pith. machine review for the scientific record. sign in

arxiv: 2605.08017 · v2 · submitted 2026-05-08 · 💻 cs.SE

Recognition: no theorem link

Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:17 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI coding agentspull requestscollaborator toolsassistant toolsworkflow governancesoftware development automationPR lifecycle analysishuman-AI collaboration
0
0 comments X

The pith

AI coding tools split into two types: some initiate and drive PRs while others only assist, but humans keep final merge authority in both cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper classifies five AI coding tools along a Collaborator-Assistant spectrum according to who starts the work and who authorizes completion in pull request workflows. Collaborator tools concentrate initiative in the agent, which opens branches and carries the changes forward, while humans handle review and endorsement. Assistant tools keep task direction with the human and supply only bounded support inside human-led flows. Analysis of 29,585 reconstructed PR lifecycles shows that operational agency and merge governance have decoupled: agent-initiated PRs reach 96 percent or higher for collaborator tools, yet almost all terminal merge decisions stay with humans. The work supplies an Initiator x Approver taxonomy, per-tool state machines, and a replication package to study these patterns.

Core claim

We characterize tools along a Collaborator-Assistant spectrum in how they redistribute initiative, oversight, and endorsement, while merge governance remains predominantly human across five tools (OpenAI, Copilot, Devin, Cursor, Claude Code). Collaborator tools (Cursor, Devin, Copilot) concentrate operational initiative in agents that open and carry PR work forward, with humans retaining review and endorsement on the path to merge; Assistant tools (OpenAI, Claude) leave task direction primarily with humans and supply bounded support within human-led workflows. Across the spectrum, agency and governance decouple: Collaborator workflows are >=96% agent initiated, yet terminal merge authority,

What carries the argument

Initiator x Approver taxonomy with six interaction scenarios that reconstructs each PR lifecycle to assign who starts the work and who authorizes its completion.

If this is right

  • Collaborator workflows are 96 percent or more agent initiated.
  • Terminal merge authority remains almost exclusively human.
  • Agent-classified approvers appear in only a small fraction of PRs.
  • When automation executes a merge, logs record the executor but not the decision-maker.
  • The taxonomy, per-tool state machines, and replication package enable further study of automation and oversight in PR workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams adopting collaborator tools may need review processes tuned to high-volume agent output rather than initiation control.
  • The observed log boundary for decision-making suggests a practical need for explicit human-approval markers before any automated merge step.
  • Patterns found in open repositories could be tested in closed corporate codebases to check whether the same initiator-approver split holds.
  • The decoupling of agency from governance may generalize to other automation domains where execution is delegated but final sign-off stays human.

Load-bearing premise

The Initiator x Approver taxonomy and six interaction scenarios accurately capture the real division of labor without significant misclassification from incomplete logs or tool-specific behaviors.

What would settle it

A large sample of PRs in which non-human accounts execute the final merge decision without recorded human endorsement, or in which the taxonomy assigns roles that contradict direct inspection of commit and review logs.

Figures

Figures reproduced from arXiv: 2605.08017 by Safwat Hassan, Young Jo (seph) Chung.

Figure 1
Figure 1. Figure 1: High-level overview of the data pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PR workflow lifecycle: phases and terminal out [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Interaction scenario distribution by tool (29,585 included PRs). Stacked bars show percentage per tool in S1–S6. Copilot [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cursor PR workflow: phase transition probabilities [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Claude PR workflow: phase transition probabilities [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: OpenAI PR workflow: phase transition probabilities [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

When AI coding agents open branches and submit pull requests (PRs), two questions co-determine oversight design: who starts the work (operational agency) and who authorizes its completion (merge governance). We characterize tools along a Collaborator-Assistant spectrum in how they redistribute initiative, oversight, and endorsement, while merge governance remains predominantly human across five tools (OpenAI, Copilot, Devin, Cursor, Claude Code). We analyze 29,585 PR lifecycles using an Initiator x Approver taxonomy with six interaction scenarios; lifecycle reconstruction supplies the how behind those roles. Collaborator tools (Cursor, Devin, Copilot) concentrate operational initiative in agents that open and carry PR work forward, with humans retaining review and endorsement on the path to merge; Assistant tools (OpenAI, Claude) leave task direction primarily with humans and supply bounded support within human-led workflows. Across the spectrum, agency and governance decouple: Collaborator workflows are >=96% agent initiated, yet terminal merge authority remains almost exclusively human, with agent-classified approvers confined to a small fraction of PRs. Where automation executes a merge, logs record the executor but not the decision-maker, marking a boundary of observation. We contribute the taxonomy, per-tool state machines, and a replication package for research on automation, oversight, and governance in PR workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes 29,585 PR lifecycles across five AI coding tools (OpenAI, Copilot, Devin, Cursor, Claude Code) using an Initiator x Approver taxonomy with six interaction scenarios. It positions the tools on a Collaborator-Assistant spectrum, reporting that Collaborator tools (Cursor, Devin, Copilot) show agents initiating and advancing >=96% of PRs while humans retain review and merge endorsement, whereas Assistant tools (OpenAI, Claude) keep task direction with humans and supply bounded support. Merge governance remains almost exclusively human across tools, with agency and governance decoupling as a key observation; the work contributes the taxonomy, per-tool state machines, and a replication package.

Significance. If the taxonomy classifications hold, the study supplies a large-scale empirical basis for understanding how AI coding agents redistribute operational initiative versus oversight in real PR workflows. The dataset size and replication package are clear strengths that enable follow-on research on automation, governance, and tool design in software engineering.

major comments (2)
  1. [§3] §3 (Taxonomy and lifecycle reconstruction): The six-scenario Initiator x Approver taxonomy is load-bearing for the Collaborator-Assistant spectrum and the >=96% agent-initiation claim, yet the reconstruction from commit authorship, branch creation, and merge logs includes no validation set, inter-rater check against full PR threads, or sensitivity analysis for cases where human prompts are omitted from logs or where automated merges record only the executor.
  2. [§4.2] §4.2 (Per-tool results): The differential logging fidelity across Cursor, Devin, Copilot, OpenAI, and Claude is acknowledged as a boundary but not quantified; without explicit handling or exclusion criteria for edge cases in state reconstruction, the reported separation between Collaborator (>=96% agent-initiated) and Assistant categories risks systematic bias.
minor comments (2)
  1. [Abstract] Abstract: The per-tool PR counts are not stated, which would help readers evaluate the balance underlying the spectrum claims.
  2. [Figures] Figure captions: Ensure state-machine diagrams explicitly map each transition to one of the six scenarios for clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. The feedback highlights key methodological considerations for our taxonomy and results. We address each major comment below and have revised the manuscript to incorporate additional analyses where feasible.

read point-by-point responses
  1. Referee: [§3] §3 (Taxonomy and lifecycle reconstruction): The six-scenario Initiator x Approver taxonomy is load-bearing for the Collaborator-Assistant spectrum and the >=96% agent-initiation claim, yet the reconstruction from commit authorship, branch creation, and merge logs includes no validation set, inter-rater check against full PR threads, or sensitivity analysis for cases where human prompts are omitted from logs or where automated merges record only the executor.

    Authors: We acknowledge that our reconstruction method, based on commit authorship, branch creation, and merge logs, does not include a held-out validation set or inter-rater reliability assessment against full PR discussion threads. This limitation stems from the scale of the 29,585 PR dataset and the log-centric data sources, which do not uniformly capture complete conversational histories. To strengthen the work, we have added a sensitivity analysis in the revised Section 3 that systematically varies assumptions regarding omitted human prompts and executor-only merge records. The analysis demonstrates that the core Collaborator-Assistant classifications and the >=96% agent-initiation rates remain stable under these perturbations. We have also expanded the explicit discussion of observational boundaries in the taxonomy description. revision: partial

  2. Referee: [§4.2] §4.2 (Per-tool results): The differential logging fidelity across Cursor, Devin, Copilot, OpenAI, and Claude is acknowledged as a boundary but not quantified; without explicit handling or exclusion criteria for edge cases in state reconstruction, the reported separation between Collaborator (>=96% agent-initiated) and Assistant categories risks systematic bias.

    Authors: We agree that quantifying differential logging fidelity and providing explicit handling for edge cases would reduce potential bias concerns. In the revised manuscript, we have added a new quantitative assessment in Section 4.2 that reports per-tool estimates of logging completeness derived from available metadata fields. We now specify exclusion criteria for ambiguous reconstruction cases (e.g., PRs with incomplete branch or commit metadata) and present robustness results both including and excluding these cases. The separation between Collaborator tools (>=96% agent-initiated) and Assistant tools is preserved in both analyses, supporting the reported spectrum while transparently documenting the boundary conditions. revision: yes

standing simulated objections not resolved
  • Full inter-rater validation against complete PR discussion threads across the entire dataset, due to the log-based nature of the data sources which do not provide uniform access to conversational content for all 29,585 PRs.

Circularity Check

0 steps flagged

No circularity: direct empirical classification of observed PR events

full rationale

The paper defines an Initiator x Approver taxonomy with six scenarios as a contribution, then applies it to reconstruct 29,585 PR lifecycles from commit authorship, branch creation, and merge logs. Reported distributions (e.g., Collaborator tools >=96% agent-initiated with human merge authority) are direct outputs of this classification on the data; no equations, fitted parameters, predictions, or self-citations reduce any result to prior definitions by construction. The taxonomy and state machines are presented as new, and the analysis remains self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that PR logs allow faithful reconstruction of initiator and approver roles and that the five tools can be cleanly placed on the collaborator-assistant spectrum without hybrid cases dominating the data.

axioms (1)
  • domain assumption PR lifecycle events in the studied platforms can be reliably mapped to initiator and approver roles from available metadata and logs.
    Required to classify 29,585 PRs into the six interaction scenarios.
invented entities (1)
  • Collaborator-Assistant spectrum no independent evidence
    purpose: Framework to classify how AI tools redistribute operational initiative versus human oversight.
    New classification axis introduced to organize observations across the five tools.

pith-pipeline@v0.9.0 · 5543 in / 1305 out tokens · 38901 ms · 2026-05-14T21:17:36.129937+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

  1. [1]

    Agarwal, H

    S. Agarwal, H. He, and B. Vasilescu. AI IDEs or Autonomous Agents? Mea- suring the Impact of Coding Agents on Software Development. InProc. 23rd Int. Conf. Mining Software Repositories (MSR ’26), 2026. Mining Challenge Track. arXiv:2601.13597

  2. [2]

    Understanding the Challenges and Opportunities of Generative AI Apps: An Empirical Study

    B. AlMulla, M. Assi, and S. Hassan. Understanding the Challenges and Opportuni- ties of Generative AI Apps: An Empirical Study. arXiv preprint arXiv:2506.16453, 2025

  3. [3]

    Baird and L

    A. Baird and L. M. Maruping. The Next Generation of Research on IS Use: A Theoretical Framework of Delegation to and from Agentic IS Artifacts.MIS Quarterly, 45(1), pp. 315–341, 2021. https://doi.org/10.25300/MISQ/2021/15882

  4. [4]

    Cohen.Statistical Power Analysis for the Behavioral Sciences

    J. Cohen.Statistical Power Analysis for the Behavioral Sciences. 2nd ed., Lawrence Erlbaum Associates, 1988

  5. [5]

    J. St. B. T. Evans and K. E. Stanovich. Dual-Process Theories of Higher Cognition: Advancing the Debate.Perspectives on Psychological Science, 8(3), pp. 223–241,

  6. [6]

    https://doi.org/10.1177/1745691612460685

  7. [7]

    H. Gao, P. Banyongrakkul, H. Guan, M. Zahedi, and C. Treude. On Autopilot? An Empirical Study of Human–AI Teaming and Review Practices in Open Source. InProc. 23rd Int. Conf. Mining Software Repositories (MSR ’26), 2026. Mining Challenge Track. arXiv:2601.13754

  8. [8]

    Fügener, J

    A. Fügener, J. Grahl, A. Gupta, and W. Ketter. Will Humans-in-the-Loop Become Borgs? Merits and Pitfalls of Working with AI.MIS Quarterly, 45(3b), pp. 1527– 1556, 2021. https://doi.org/10.25300/misq/2021/16553

  9. [9]

    About Protected Branches

    GitHub. About Protected Branches. GitHub Docs, 2024. https: //docs.github.com/en/repositories/configuring-branches-and-merges-in- your-repository/managing-protected-branches/about-protected-branches

  10. [10]

    Glikson and A

    E. Glikson and A. W. Woolley. Human Trust in Artificial Intelligence: Review of Empirical Research.Academy of Management Annals, 14(2), pp. 627–660, 2020. https://doi.org/10.5465/annals.2018.0057

  11. [11]

    Gousios, M

    G. Gousios, M. Pinzger, and A. van Deursen. An Exploratory Study of the Pull-based Software Development Model. InProc. 36th Int. Conf. on Software Engineering (ICSE), pp. 345–355, 2014. https://doi.org/10.1145/2568225.2568260

  12. [12]

    D. F. Halpern.Thought and Knowledge: An Introduction to Critical Thinking, 5th ed. Psychology Press, 2014

  13. [13]

    Hollan, E

    J. Hollan, E. Hutchins, and D. Kirsh. Distributed Cognition: Toward a New Foundation for Human-Computer Interaction Research.ACM Trans. Comput.- Hum. Interact., 7(2), pp. 174–196, 2000. https://doi.org/10.1145/353485.353487

  14. [14]

    Kalliamvakou, G

    E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian. The Promises and Perils of Mining GitHub. InProc. 11th Working Conf. Mining Software Repositories (MSR), pp. 92–101, 2014. https://doi.org/10.1145/2597073. 2597074

  15. [15]

    H. Li, H. Zhang, and A. E. Hassan. AIDev Challenge: Can You Predict Merge Decisions of AI-Coding-Agent Generated Pull Requests? InProc. IEEE/ACM Int. Conf. Mining Software Repositories (MSR), Mining Challenge, pp. 1–5, 2025. arXiv:2504.20423

  16. [16]

    H. Li, H. Zhang, and A. E. Hassan. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Soft- ware Engineering. InProc. ACM Joint European Software Engineering Conf. and Symp. on the Foundations of Software Engineering (ESEC/FSE), pp. 1–22, 2025. arXiv:2502.11387

  17. [17]

    A. F. Nogueira and M. Zenha-Rela. Monitoring a CI/CD Workflow Using Process Mining.SN Comput. Sci., 2(448), 2021. https://doi.org/10.1007/s42979-021-00830-2

  18. [18]

    Okamura and S

    K. Okamura and S. Yamada. Adaptive Trust Calibration for Human-AI Collabo- ration.PLOS ONE, 15(2), e0229132, 2020. https://doi.org/10.1371/journal.pone. 0229132

  19. [19]

    Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

    G. Pinna, J. Gong, D. Williams, and F. Sarro. Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance. InProc. 23rd Int. Conf. Mining Software Repositories (MSR ’26), 2026. Mining Challenge Track. arXiv:2602.08915

  20. [20]

    Rahman, M

    S. Rahman, M. F. Rabbi, and M. F. Zibran. A Task-Level Evaluation of AI Agents in Open-Source Projects. InProc. 23rd Int. Conf. Mining Software Repositories (MSR ’26), 2026. Mining Challenge Track. arXiv:2602.02345. AIware ’26, July 06–07, 2026, Montreal, QC, Canada Chung and Hassan

  21. [21]

    Raisch and S

    S. Raisch and S. Krakowski. Artificial Intelligence and Management: The Automation-Augmentation Paradox.Academy of Management Review, 46(1), pp. 192–210, 2021. https://doi.org/10.5465/amr.2018.0072

  22. [22]

    Roychoudhury et al

    A. Roychoudhury et al. Agentic AI Software Engineers: Programming with Trust. arXiv preprint arXiv:2502.13767, 2025. https://doi.org/10.48550/arXiv.2502.13767

  23. [23]

    V. A. Rubin, A. A. Mitsyuk, I. A. Lomazova, and W. M. van der Aalst. Process Mining Can Be Applied to Software Too! InProc. 8th ACM/IEEE Int. Symp. on Empirical Software Engineering and Measurement (ESEM), pp. 1–4, 2014. https: //doi.org/10.1145/2652524.2652583

  24. [24]

    V. A. Rubin and S. A. Shershakov. System Runs Analysis with Process Mining. InProc. 30th IEEE/ACM Int. Conf. on Automated Software Engineering Workshops (ASEW), pp. 48–51, 2015

  25. [25]

    T. B. Sheridan and W. L. Verplank. Human and Computer Control of Undersea Teleoperators. Tech. Rep., MIT Man-Machine Systems Laboratory, 1978

  26. [26]

    Treude, M.-A

    C. Treude, M.-A. Storey, and J. Weber. Empirical Studies on Collaboration in Software Development: A Systematic Literature Review. Tech. Rep. DCS-352-IR, University of Victoria, 2012

  27. [27]

    Treude and M

    C. Treude and M. A. Gerosa. How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in Software Engineering. InProc. 2nd IEEE/ACM Int. Conf. on AI Foundation Models and Software Engineering (Forge 2025), 2025. https://doi.org/10.1109/Forge66646.2025.00033

  28. [28]

    W. M. P. van der Aalst.Process Mining: Data Science in Action, 2nd ed. Springer, 2016