pith. machine review for the scientific record. sign in

arxiv: 2507.15003 · v1 · submitted 2025-07-20 · 💻 cs.SE · cs.AI· cs.CE· cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:33 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CEcs.LG
keywords AI coding agentsautonomous teammatespull request datasetsoftware engineeringhuman-AI collaborationempirical studyAIDevSE 3.0
0
0 comments X

The pith

AIDev supplies the first large-scale dataset of 456,000 real pull requests from five autonomous coding agents to ground study of AI teammates in software development.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AIDev as a new public dataset drawn from over 456,000 pull requests made by five major coding agents across 61,000 repositories. It positions this collection as the empirical base needed to move beyond theory or synthetic tests when examining how autonomous agents initiate, review, and merge code alongside human developers. A reader would care because the data already shows measurable patterns such as faster submissions paired with lower acceptance rates and simpler structural changes, which point to concrete trust and utility gaps in current human-AI workflows.

Core claim

Autonomous coding agents are now operating at scale in open repositories, and the AIDev dataset of 456,000 pull requests supplies the structured, real-world traces required to study their behavior, including metadata on authorship, review timelines, code changes, and integration outcomes. This resource directly supports research on benchmarking, agent readiness, optimization, collaboration modeling, and AI governance without relying on synthetic benchmarks such as SWE-bench.

What carries the argument

AIDev, the dataset of pull requests with rich metadata on authorship, review timelines, code changes, and integration outcomes that functions as the empirical foundation for analyzing agent behavior in the wild.

If this is right

  • Agents complete submissions faster than humans yet see lower acceptance rates, revealing a measurable trust gap.
  • Individual developers can increase their output rate dramatically when using agents, with some matching years of prior work in days.
  • Agent-generated changes register as structurally simpler on standard complexity metrics than comparable human changes.
  • The dataset can serve as a living, extensible base for new benchmarks and governance studies in SE 3.0 workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed speed-acceptance mismatch suggests a need for new review tools that surface agent-specific risk signals.
  • Repository maintainers could use AIDev-style traces to set policy thresholds for automated contributions.
  • Comparison of agent performance across project sizes or languages becomes feasible for the first time with this scale of data.

Load-bearing premise

The pull requests from the five selected agents and the repositories that expose their activity accurately represent typical in-the-wild agent behavior without major selection bias.

What would settle it

A follow-up collection that samples a wider set of agents or repositories and finds substantially different acceptance rates or code-complexity distributions would indicate that AIDev patterns do not generalize.

read the original abstract

The future of software engineering--SE 3.0--is unfolding with the rise of AI teammates: autonomous, goal-driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large-scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents--OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code--across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development. Unlike prior work that has largely theorized the rise of AI-native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes--enabling exploration beyond synthetic benchmarks like SWE-bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission--one developer submitted as many PRs in three days as they had in three years--these are structurally simpler (via code complexity metrics). We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real-world evidence, AIDev enables a new generation of research into AI-native workflows and supports building the next wave of symbiotic human-AI collaboration. The dataset is publicly available at https://github.com/SAILResearch/AI_Teammates_in_SE3. > AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AIDev, a dataset of 456,000 pull requests generated by five autonomous coding agents (OpenAI Codex, Devin, GitHub Copilot, Cursor, Claude Code) across 61,000 repositories and 47,000 developers. It positions the release as the first large-scale empirical resource for studying AI teammates in SE 3.0, enabling research on benchmarking, collaboration modeling, and governance, while reporting two headline observations: agents submit PRs faster than humans but with lower acceptance rates, and the changes are structurally simpler by code-complexity metrics.

Significance. If the collection process is fully documented and selection effects are quantified, AIDev would constitute a valuable public resource that moves the field beyond synthetic benchmarks such as SWE-bench. The scale and open availability could support reproducible studies of agent readiness, optimization, and human-AI workflow modeling.

major comments (2)
  1. [Data Collection / Methods] Data-collection section: the manuscript supplies no description of how PRs were identified as originating from the five named agents, what attribution heuristics or repository filters were applied, or any validation steps against misattribution. Because the central claim is that the 456k PRs furnish a representative empirical foundation, the absence of these details leaves the representativeness assumption unverified and load-bearing for all downstream uses.
  2. [Results / Empirical Observations] Results / Observations paragraph: the statements that agents are faster yet less frequently accepted and produce simpler changes are presented without accompanying quantitative metrics (e.g., median time-to-merge, acceptance-rate deltas with confidence intervals, or complexity-measure definitions), statistical controls, or comparison baselines. These observations are used to illustrate the dataset’s utility, yet cannot be evaluated for robustness.
minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction repeatedly use the phrase “first large-scale dataset” without citing or contrasting prior public PR corpora that include agent-generated activity; a brief related-work paragraph would clarify novelty.
  2. [Dataset Description] The GitHub repository link is given but no summary of the exact schema, file formats, or example records is provided in the paper; readers cannot assess usability without downloading the data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our manuscript. We agree that additional documentation is needed and will revise the paper accordingly. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Data Collection / Methods] Data-collection section: the manuscript supplies no description of how PRs were identified as originating from the five named agents, what attribution heuristics or repository filters were applied, or any validation steps against misattribution. Because the central claim is that the 456k PRs furnish a representative empirical foundation, the absence of these details leaves the representativeness assumption unverified and load-bearing for all downstream uses.

    Authors: We agree that the manuscript currently provides insufficient detail on the data collection and attribution process. The full pipeline—including heuristics based on commit author metadata, PR titles/descriptions, repository tags, and cross-referencing with known agent activity patterns—is implemented and documented in the public GitHub repository (https://github.com/SAILResearch/AI_Teammates_in_SE3). In the revised manuscript we will add a dedicated subsection to the Methods section that explicitly describes the identification heuristics, applied repository and PR filters, and validation steps (including manual sampling of 500 PRs and error-rate estimates). This addition will directly address concerns about representativeness and allow readers to assess potential selection effects. revision: yes

  2. Referee: [Results / Empirical Observations] Results / Observations paragraph: the statements that agents are faster yet less frequently accepted and produce simpler changes are presented without accompanying quantitative metrics (e.g., median time-to-merge, acceptance-rate deltas with confidence intervals, or complexity-measure definitions), statistical controls, or comparison baselines. These observations are used to illustrate the dataset’s utility, yet cannot be evaluated for robustness.

    Authors: The referee is correct that the headline observations are stated at a high level without supporting statistics in the current text. These claims are based on analyses performed on the released AIDev dataset, but the manuscript does not report the underlying numbers or controls. In the revision we will introduce a new “Preliminary Empirical Observations” subsection that supplies concrete metrics: median time-to-merge (with interquartile ranges), acceptance rates with 95% confidence intervals and deltas relative to human PRs in the same repositories, explicit definitions of the complexity metrics used (e.g., cyclomatic complexity, change size in LOC), and basic statistical comparisons. Where data permit, we will also note controls for repository and developer characteristics. This will make the illustrative claims evaluable and strengthen the demonstration of the dataset’s utility. revision: yes

Circularity Check

0 steps flagged

No circularity: data release paper with no derivations or fitted predictions

full rationale

The paper introduces the AIDev dataset of 456k PRs from five agents across 61k repositories and provides descriptive observations (e.g., faster submission but lower acceptance rates, structurally simpler changes). No equations, models, or predictions are derived; the contribution is the open data release itself. No self-citations, ansatzes, or uniqueness claims are used to justify core results, and no step reduces by construction to fitted inputs or prior self-referential work. The analysis chain is self-contained as an empirical resource without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence and utility of the released dataset rather than any mathematical model; therefore the ledger contains no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5668 in / 1102 out tokens · 52434 ms · 2026-05-15T17:33:33.499204+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks

    cs.SE 2026-04 conditional novelty 8.0

    The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.

  2. To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study

    cs.SE 2026-05 unverdicted novelty 7.0

    AI-generated code requires less maintenance than human code, with humans handling the majority of changes that are mostly feature extensions rather than bug fixes.

  3. Do AI Coding Agents Log Like Humans? An Empirical Study

    cs.SE 2026-04 unverdicted novelty 7.0

    AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans perfo...

  4. AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub

    cs.SE 2026-04 accept novelty 7.0

    AgenticFlict is a public dataset of 29K+ textual merge conflicts from AI agent PRs, collected via merge simulation on 107K processed PRs and showing a 27.67% conflict rate with variation across agents.

  5. A Large-Scale Empirical Study of AI-Generated Code in Real-World Repositories

    cs.SE 2026-03 unverdicted novelty 7.0

    A large-scale study of real-world repositories finds that AI-generated code differs from human-written code in complexity, structural traits, defect indicators, and commit-level activity patterns.

  6. Mining Type Constructs Using Patterns in AI-Generated Code

    cs.SE 2026-02 unverdicted novelty 7.0

    AI-generated TypeScript code uses the 'any' type 9x more often than human code and employs more advanced type constructs that can ignore checks, but agentic PRs have 1.8x higher acceptance rates.

  7. To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study

    cs.SE 2026-05 unverdicted novelty 6.0

    AI-generated code requires less maintenance than human-written code, mostly involving feature additions by humans rather than bug fixes.

  8. Hot Fixing in the Wild

    cs.SE 2026-04 unverdicted novelty 6.0

    Hot fixes show urgency patterns with reduced collaboration and testing, differing from regular fixes, and human versus AI agents display over 10 distinct repair behaviors in large-scale GitHub data.

  9. On the Footprints of Reviewer Bots Feedback on Agentic Pull Requests in OSS GitHub Repositories

    cs.SE 2026-04 unverdicted novelty 6.0

    Reviewer bots' higher comment volume on AI agent PRs is associated with slower resolutions and poorer average feedback quality, while feedback quality itself has no association with PR outcomes.

  10. Insights into Security-Related AI-Generated Pull Requests

    cs.SE 2026-04 unverdicted novelty 6.0

    AI-generated security pull requests frequently contain a small set of recurring weaknesses, with many flawed ones merged and rejections driven by process factors rather than technical issues.

  11. ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation

    cs.SE 2026-04 unverdicted novelty 6.0

    ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.

  12. Agentic Business Process Management: A Research Manifesto

    cs.AI 2026-03 unverdicted novelty 6.0

    Agentic Business Process Management reframes BPM around autonomous agents that must exhibit framed autonomy, explainability, conversational actionability, and self-modification to keep their actions aligned with organ...

  13. Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes

    cs.SE 2026-03 unverdicted novelty 6.0

    An empirical study of real-world issues yields a taxonomy of 34 fault types, symptoms, and root causes in agentic AI systems, validated by 145 practitioners.

  14. CoT-Guard: Small Models for Strong Monitoring

    cs.CR 2026-05 unverdicted novelty 5.0

    CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

  15. These Aren't the Reviews You're Looking For How Humans Review AI-Generated Pull Requests

    cs.SE 2026-05 unverdicted novelty 5.0

    AI-generated PRs on GitHub receive fewer human reviews and more AI-mediated interactions than human-authored PRs.

  16. KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant

    cs.SE 2026-04 unverdicted novelty 5.0

    KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent his...

  17. Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer

    cs.SE 2026-04 unverdicted novelty 5.0

    Agentic Consensus replaces code as the main artifact with a typed property graph world model that maintains commitments and evidence through synchronization operators, shifting evaluation to alignment fidelity and con...

  18. Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects

    cs.SE 2026-04 conditional novelty 5.0

    AI IDEs with structured guidance can produce functional large-scale code but frequently introduce design flaws such as duplication, complexity, and principle violations that risk long-term maintainability.

  19. From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests

    cs.SE 2026-04 conditional novelty 5.0

    Code review agents achieve 45.20% merge rate on PRs versus 68.37% for humans, with 60.2% of agent-only closed PRs showing 0-30% signal quality.

  20. Beyond the 'Diff': Addressing Agentic Entropy in Agentic Software Development

    cs.SE 2026-03 unverdicted novelty 5.0

    Agentic entropy names the systemic drift in AI coding agents away from architectural intent; a new framework using conformity seeding, reasoning monitoring, and causal graph interfaces supplies process-level oversight...

  21. Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review

    cs.SE 2026-04 unverdicted novelty 2.0

    A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprep...

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 20 Pith papers · 3 internal anchors

  1. [1]

    [n. d.]. Introducing Codex. https://openai.com/index/introducing-codex/. [Accessed 07-07-2025]

  2. [2]

    [n. d.]. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. https://metr.org/blog/2025-07-10-early-2025- ai-experienced-os-dev-study/. [Accessed 17-07-2025]

  3. [3]

    Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. 2025. Can LLMs Replace Manual Annotation of Software Engineering Artifacts?. In 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR) . 526–538. doi:10.1109/MSR66628.2025.00086

  4. [4]

    Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, and Abhik Roychoudhury. 2025. Unified Software Engineering agent as AI Software Engineer. arXiv:2506.14683 [cs.SE] https://arxiv.org/abs/2506.14683

  5. [5]

    Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2025. To Code or Not To Code? Exploring Impact of Code in Pre-training. In The Thirteenth International Conference on Learning Representations . https://openreview.net/forum?id=zSfeN1uAcx

  6. [6]

    Oliva, Gopi Krishnan Rajbahadur, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, and Ahmed E

    Aaditya Bhatia, Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, and Ahmed E. Hassan. 2025. SPICE: An Automated SWE -Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation. (Jul 2025). arXiv:2507.09108 [cs.SE] doi:10.48550/arXiv.2507.09108

  7. [7]

    Islem Bouzenia, Prem Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) (2024), 2188–2200

  8. [8]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair . In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) . IEEE Computer Society, Los Alamitos, CA, USA, 2188–2200. doi:10.1109/ ICSE55347.2025.00157

  9. [9]

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv:1606.01540 [cs.LG] https://arxiv.org/abs/1606.01540

  10. [10]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code . arXiv:2107.03374 [cs.SE] https://arxiv.org/abs/2107.03374

  11. [11]

    Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, and Satish Chandra. 2025. Agentic Bug Reproduction for Effective Automated Program Repair at Google. (2025). arXiv:2502.01821 [cs.SE] https://arxiv.org/abs/2502.01821

  12. [12]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.AI] https://arxiv.org/abs/2403.04132

  13. [13]

    Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emircan Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. 2024. Automated Code Review In Practice. arXiv:2412.18531 [cs.SE] https://arxiv.org/abs/2412.18531

  14. [14]

    Dekun Dai, MingWei Liu, Anji Li, Jialun Cao, Yanlin Wang, et al. 2025. FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks. arXiv preprint arXiv:2504.06939 (2025)

  15. [15]

    Giuseppe Desolda, Andrea Esposito, Francesco Greco, Cesare Tucci, Paolo Buono, and Antonio Piccinno. 2025. Understanding User Mental Models in AI-Driven Code Completion Tools: Insights from an Elicitation Study. arXiv:2502.02194 [cs.HC] https://arxiv.org/abs/2502.02194

  16. [16]

    Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-Collaboration Code Generation via ChatGPT. ACM Transactions on Software Engineering and Methodology 33 (2023), 1–38

  17. [17]

    Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler. 2021. The SPACE of Developer Productivity: There’s more to it than you think.Queue 19, 1 (March 2021), 20–48. doi:10.1145/3454122.3454124

  18. [18]

    Hassan, Dayi Lin, Gopi Krishnan Rajbahadur, Keheliya Gallaba, Filipe Roseiro Cogo, Boyuan Chen, Haoxiang Zhang, Kishanthan Thangarajah, Gustavo Oliva, Jiahuei (Justina) Lin, et al

    Ahmed E. Hassan, Dayi Lin, Gopi Krishnan Rajbahadur, Keheliya Gallaba, Filipe Roseiro Cogo, Boyuan Chen, Haoxiang Zhang, Kishanthan Thangarajah, Gustavo Oliva, Jiahuei (Justina) Lin, et al . 2024. Rethinking Software Engineering in the Era of Foundation Models: A Curated Catalogue of Challenges in the Development of Trustworthy FMware. In Companion Procee...

  19. [19]

    Hassan, Gustavo A

    Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, and Zhen Ming Jiang. 2024. Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap . arXiv:2410.06107 [cs.SE] https://arxiv.org/abs/2410.06107

  20. [20]

    Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead. ACM Trans. Softw. Eng. Methodol. 34, 5, Article 124 (May 2025), 30 pages. doi:10.1145/3712003

  21. [21]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. In The Twelfth International Conference on Learning Representations . https://openreview.net/forum?id= VTF8yNQM66

  22. [22]

    Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason. arXiv:2506.12286 [cs.AI] https://arxiv.org/abs/2506.12286

  23. [23]

    Shanchao Liang, Yiran Hu, Nan Jiang, and Lin Tan. 2024. Can Language Models Replace Programmers? REPOCOD Says “Not Yet”.arXiv preprint arXiv:2410.21647 (2024)

  24. [24]

    Feng Lin, Dong Jae Kim, and Tse-Hsun Chen. 2024. SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) (2024), 1527–1539. Manuscript submitted to ACM 22 Hao Li, Haoxiang Zhang, and Ahmed E. Hassan

  25. [25]

    Long, Du Feng, and Norman Cliff

    Jeffrey D. Long, Du Feng, and Norman Cliff. 2003. Ordinal Analysis of Behavioral Data. In Handbook of Psychology, Irving B. Weiner (Ed.). John Wiley & Sons, Inc., Hoboken, NJ, USA, Chapter 25, 635–661. doi:10.1002/0471264385.wei0225

  26. [26]

    H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Annals of Mathematical Statistics 18 (1947), 50–60

  27. [27]

    Symeonidis

    Dimitrios-Nikitas Nastos, Themistoklis Diamantopoulos, Davide Tosi, Martina Tropeano, and Andreas L. Symeonidis. 2025. Towards an Interpretable Analysis for Estimating the Resolution Time of Software Issues. arXiv:2505.01108 [cs.SE] https://arxiv.org/abs/2505.01108

  28. [28]

    Ketai Qiu, Niccolò Puccinelli, Matteo Ciniselli, and Luca Di Grazia. 2025. From Today’s Code to Tomorrow’s Symphony: The AI Transformation of Developer’s Routine by 2030. ACM Trans. Softw. Eng. Methodol. 34, 5, Article 121 (May 2025), 17 pages. doi:10.1145/3709353

  29. [29]

    Rigby, Seth Rogers, Sadruddin Saleem, Parth Suresh, Daniel Suskin, Patrick Riggs, Chandra Maddila, Nachiappan Nagappan, and Audris Mockus

    Peter C. Rigby, Seth Rogers, Sadruddin Saleem, Parth Suresh, Daniel Suskin, Patrick Riggs, Chandra Maddila, Nachiappan Nagappan, and Audris Mockus. 2025. Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders. ACM Trans. Softw. Eng. Methodol. (May 2025). doi:10.1145/3736405 Just Accepted

  30. [30]

    Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, Jeff Skowronek, and Linda Devine. 2006. Exploring methods for evaluating group differences on the NSSE and other surveys: Are the t-test and Cohen’sd indices the most appropriate choices. Inannual meeting of the Southern Association for Institutional Research. Citeseer, 1–51

  31. [31]

    Abhik Roychoudhury, Corina Pasareanu, Michael Pradel, and Baishakhi Ray. 2025. Agentic AI Software Engineers: Programming with Trust. arXiv:2502.13767 [cs.SE] https://arxiv.org/abs/2502.13767

  32. [32]

    David Silver and Richard S Sutton. 2025. Welcome to the era of experience. Google AI 1 (2025)

  33. [33]

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2025. Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning. In The Thirteenth International Conference on Learning Representations . https://openreview.net/forum?id=4FWAwZtd2n

  34. [34]

    Jimenez, Alex L

    John Yang, Carlos E. Jimenez, Alex L. Zhang, et al. 2024. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? arXiv preprint arXiv:2410.03859 (2024)

  35. [35]

    Shaoyi Yang. 2025. Chain of Draft for Software Engineering: Challenges in Applying Concise Reasoning to Code Tasks. arXiv:2506.10987 [cs.SE] https://arxiv.org/abs/2506.10987

  36. [36]

    Xin Zhou, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, and David Lo

  37. [37]

    arXiv:2502.06215 [cs.SE] https://arxiv.org/abs/2502.06215 Manuscript submitted to ACM

    LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks. arXiv:2502.06215 [cs.SE] https://arxiv.org/abs/2502.06215 Manuscript submitted to ACM