arxiv: 2605.08017 · v2 · submitted 2026-05-08 · 💻 cs.SE

Recognition: no theorem link

Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles

Young Jo (seph) Chung , Safwat Hassan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:17 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI coding agentspull requestscollaborator toolsassistant toolsworkflow governancesoftware development automationPR lifecycle analysishuman-AI collaboration

0 comments

The pith

AI coding tools split into two types: some initiate and drive PRs while others only assist, but humans keep final merge authority in both cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper classifies five AI coding tools along a Collaborator-Assistant spectrum according to who starts the work and who authorizes completion in pull request workflows. Collaborator tools concentrate initiative in the agent, which opens branches and carries the changes forward, while humans handle review and endorsement. Assistant tools keep task direction with the human and supply only bounded support inside human-led flows. Analysis of 29,585 reconstructed PR lifecycles shows that operational agency and merge governance have decoupled: agent-initiated PRs reach 96 percent or higher for collaborator tools, yet almost all terminal merge decisions stay with humans. The work supplies an Initiator x Approver taxonomy, per-tool state machines, and a replication package to study these patterns.

Core claim

We characterize tools along a Collaborator-Assistant spectrum in how they redistribute initiative, oversight, and endorsement, while merge governance remains predominantly human across five tools (OpenAI, Copilot, Devin, Cursor, Claude Code). Collaborator tools (Cursor, Devin, Copilot) concentrate operational initiative in agents that open and carry PR work forward, with humans retaining review and endorsement on the path to merge; Assistant tools (OpenAI, Claude) leave task direction primarily with humans and supply bounded support within human-led workflows. Across the spectrum, agency and governance decouple: Collaborator workflows are >=96% agent initiated, yet terminal merge authority,

What carries the argument

Initiator x Approver taxonomy with six interaction scenarios that reconstructs each PR lifecycle to assign who starts the work and who authorizes its completion.

If this is right

Collaborator workflows are 96 percent or more agent initiated.
Terminal merge authority remains almost exclusively human.
Agent-classified approvers appear in only a small fraction of PRs.
When automation executes a merge, logs record the executor but not the decision-maker.
The taxonomy, per-tool state machines, and replication package enable further study of automation and oversight in PR workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams adopting collaborator tools may need review processes tuned to high-volume agent output rather than initiation control.
The observed log boundary for decision-making suggests a practical need for explicit human-approval markers before any automated merge step.
Patterns found in open repositories could be tested in closed corporate codebases to check whether the same initiator-approver split holds.
The decoupling of agency from governance may generalize to other automation domains where execution is delegated but final sign-off stays human.

Load-bearing premise

The Initiator x Approver taxonomy and six interaction scenarios accurately capture the real division of labor without significant misclassification from incomplete logs or tool-specific behaviors.

What would settle it

A large sample of PRs in which non-human accounts execute the final merge decision without recorded human endorsement, or in which the taxonomy assigns roles that contradict direct inspection of commit and review logs.

Figures

Figures reproduced from arXiv: 2605.08017 by Safwat Hassan, Young Jo (seph) Chung.

**Figure 2.** Figure 2: PR workflow lifecycle: phases and terminal out [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Interaction scenario distribution by tool (29,585 included PRs). Stacked bars show percentage per tool in S1–S6. Copilot [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Cursor PR workflow: phase transition probabilities [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 8.** Figure 8: Claude PR workflow: phase transition probabilities [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 7.** Figure 7: OpenAI PR workflow: phase transition probabilities [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

When AI coding agents open branches and submit pull requests (PRs), two questions co-determine oversight design: who starts the work (operational agency) and who authorizes its completion (merge governance). We characterize tools along a Collaborator-Assistant spectrum in how they redistribute initiative, oversight, and endorsement, while merge governance remains predominantly human across five tools (OpenAI, Copilot, Devin, Cursor, Claude Code). We analyze 29,585 PR lifecycles using an Initiator x Approver taxonomy with six interaction scenarios; lifecycle reconstruction supplies the how behind those roles. Collaborator tools (Cursor, Devin, Copilot) concentrate operational initiative in agents that open and carry PR work forward, with humans retaining review and endorsement on the path to merge; Assistant tools (OpenAI, Claude) leave task direction primarily with humans and supply bounded support within human-led workflows. Across the spectrum, agency and governance decouple: Collaborator workflows are >=96% agent initiated, yet terminal merge authority remains almost exclusively human, with agent-classified approvers confined to a small fraction of PRs. Where automation executes a merge, logs record the executor but not the decision-maker, marking a boundary of observation. We contribute the taxonomy, per-tool state machines, and a replication package for research on automation, oversight, and governance in PR workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps AI coding tools onto a Collaborator-Assistant spectrum using 29k PRs and shows humans keep final merge control, but the log-based labels for who starts the work may not be reliable.

read the letter

The main thing here is a straightforward taxonomy that puts tools like Cursor, Devin, and Copilot on the collaborator side because they open and push most of the PR work, while OpenAI and Claude stay in assistant mode where humans set the direction. They reconstruct lifecycles across nearly 30k PRs and show that merge approval stays almost entirely human even when the agent does the heavy lifting. That decoupling of initiative from governance is the clearest new angle, and the per-tool state machines plus the replication package make it easy to see how they arrived at the six scenarios. The scale gives some weight to the descriptive claims about role division. The weak part is the reliance on commit, branch, and merge logs to assign initiator and approver roles. If a human prompt triggers an agent action that then shows up as the first event, or if automated merges hide the real decision, the collaborator percentages could be inflated. The paper does not appear to include a validation set against full PR threads or checks for tool-specific logging gaps, so differential fidelity across the five tools remains a real uncertainty. The pattern that humans retain terminal authority looks more robust than the exact split numbers. This is worth a referee for groups working on AI tooling or dev process governance, since the framework is new and the data volume is decent. I would send it to review with a request for more detail on the labeling rules and any sensitivity checks.

Referee Report

2 major / 2 minor

Summary. The paper analyzes 29,585 PR lifecycles across five AI coding tools (OpenAI, Copilot, Devin, Cursor, Claude Code) using an Initiator x Approver taxonomy with six interaction scenarios. It positions the tools on a Collaborator-Assistant spectrum, reporting that Collaborator tools (Cursor, Devin, Copilot) show agents initiating and advancing >=96% of PRs while humans retain review and merge endorsement, whereas Assistant tools (OpenAI, Claude) keep task direction with humans and supply bounded support. Merge governance remains almost exclusively human across tools, with agency and governance decoupling as a key observation; the work contributes the taxonomy, per-tool state machines, and a replication package.

Significance. If the taxonomy classifications hold, the study supplies a large-scale empirical basis for understanding how AI coding agents redistribute operational initiative versus oversight in real PR workflows. The dataset size and replication package are clear strengths that enable follow-on research on automation, governance, and tool design in software engineering.

major comments (2)

[§3] §3 (Taxonomy and lifecycle reconstruction): The six-scenario Initiator x Approver taxonomy is load-bearing for the Collaborator-Assistant spectrum and the >=96% agent-initiation claim, yet the reconstruction from commit authorship, branch creation, and merge logs includes no validation set, inter-rater check against full PR threads, or sensitivity analysis for cases where human prompts are omitted from logs or where automated merges record only the executor.
[§4.2] §4.2 (Per-tool results): The differential logging fidelity across Cursor, Devin, Copilot, OpenAI, and Claude is acknowledged as a boundary but not quantified; without explicit handling or exclusion criteria for edge cases in state reconstruction, the reported separation between Collaborator (>=96% agent-initiated) and Assistant categories risks systematic bias.

minor comments (2)

[Abstract] Abstract: The per-tool PR counts are not stated, which would help readers evaluate the balance underlying the spectrum claims.
[Figures] Figure captions: Ensure state-machine diagrams explicitly map each transition to one of the six scenarios for clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. The feedback highlights key methodological considerations for our taxonomy and results. We address each major comment below and have revised the manuscript to incorporate additional analyses where feasible.

read point-by-point responses

Referee: [§3] §3 (Taxonomy and lifecycle reconstruction): The six-scenario Initiator x Approver taxonomy is load-bearing for the Collaborator-Assistant spectrum and the >=96% agent-initiation claim, yet the reconstruction from commit authorship, branch creation, and merge logs includes no validation set, inter-rater check against full PR threads, or sensitivity analysis for cases where human prompts are omitted from logs or where automated merges record only the executor.

Authors: We acknowledge that our reconstruction method, based on commit authorship, branch creation, and merge logs, does not include a held-out validation set or inter-rater reliability assessment against full PR discussion threads. This limitation stems from the scale of the 29,585 PR dataset and the log-centric data sources, which do not uniformly capture complete conversational histories. To strengthen the work, we have added a sensitivity analysis in the revised Section 3 that systematically varies assumptions regarding omitted human prompts and executor-only merge records. The analysis demonstrates that the core Collaborator-Assistant classifications and the >=96% agent-initiation rates remain stable under these perturbations. We have also expanded the explicit discussion of observational boundaries in the taxonomy description. revision: partial
Referee: [§4.2] §4.2 (Per-tool results): The differential logging fidelity across Cursor, Devin, Copilot, OpenAI, and Claude is acknowledged as a boundary but not quantified; without explicit handling or exclusion criteria for edge cases in state reconstruction, the reported separation between Collaborator (>=96% agent-initiated) and Assistant categories risks systematic bias.

Authors: We agree that quantifying differential logging fidelity and providing explicit handling for edge cases would reduce potential bias concerns. In the revised manuscript, we have added a new quantitative assessment in Section 4.2 that reports per-tool estimates of logging completeness derived from available metadata fields. We now specify exclusion criteria for ambiguous reconstruction cases (e.g., PRs with incomplete branch or commit metadata) and present robustness results both including and excluding these cases. The separation between Collaborator tools (>=96% agent-initiated) and Assistant tools is preserved in both analyses, supporting the reported spectrum while transparently documenting the boundary conditions. revision: yes

standing simulated objections not resolved

Full inter-rater validation against complete PR discussion threads across the entire dataset, due to the log-based nature of the data sources which do not provide uniform access to conversational content for all 29,585 PRs.

Circularity Check

0 steps flagged

No circularity: direct empirical classification of observed PR events

full rationale

The paper defines an Initiator x Approver taxonomy with six scenarios as a contribution, then applies it to reconstruct 29,585 PR lifecycles from commit authorship, branch creation, and merge logs. Reported distributions (e.g., Collaborator tools >=96% agent-initiated with human merge authority) are direct outputs of this classification on the data; no equations, fitted parameters, predictions, or self-citations reduce any result to prior definitions by construction. The taxonomy and state machines are presented as new, and the analysis remains self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that PR logs allow faithful reconstruction of initiator and approver roles and that the five tools can be cleanly placed on the collaborator-assistant spectrum without hybrid cases dominating the data.

axioms (1)

domain assumption PR lifecycle events in the studied platforms can be reliably mapped to initiator and approver roles from available metadata and logs.
Required to classify 29,585 PRs into the six interaction scenarios.

invented entities (1)

Collaborator-Assistant spectrum no independent evidence
purpose: Framework to classify how AI tools redistribute operational initiative versus human oversight.
New classification axis introduced to organize observations across the five tools.

pith-pipeline@v0.9.0 · 5543 in / 1305 out tokens · 38901 ms · 2026-05-14T21:17:36.129937+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

[1]

Agarwal, H

S. Agarwal, H. He, and B. Vasilescu. AI IDEs or Autonomous Agents? Mea- suring the Impact of Coding Agents on Software Development. InProc. 23rd Int. Conf. Mining Software Repositories (MSR ’26), 2026. Mining Challenge Track. arXiv:2601.13597

work page arXiv 2026
[2]

Understanding the Challenges and Opportunities of Generative AI Apps: An Empirical Study

B. AlMulla, M. Assi, and S. Hassan. Understanding the Challenges and Opportuni- ties of Generative AI Apps: An Empirical Study. arXiv preprint arXiv:2506.16453, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Baird and L

A. Baird and L. M. Maruping. The Next Generation of Research on IS Use: A Theoretical Framework of Delegation to and from Agentic IS Artifacts.MIS Quarterly, 45(1), pp. 315–341, 2021. https://doi.org/10.25300/MISQ/2021/15882

work page doi:10.25300/misq/2021/15882 2021
[4]

Cohen.Statistical Power Analysis for the Behavioral Sciences

J. Cohen.Statistical Power Analysis for the Behavioral Sciences. 2nd ed., Lawrence Erlbaum Associates, 1988

work page 1988
[5]

J. St. B. T. Evans and K. E. Stanovich. Dual-Process Theories of Higher Cognition: Advancing the Debate.Perspectives on Psychological Science, 8(3), pp. 223–241,

work page
[6]

https://doi.org/10.1177/1745691612460685

work page doi:10.1177/1745691612460685
[7]

H. Gao, P. Banyongrakkul, H. Guan, M. Zahedi, and C. Treude. On Autopilot? An Empirical Study of Human–AI Teaming and Review Practices in Open Source. InProc. 23rd Int. Conf. Mining Software Repositories (MSR ’26), 2026. Mining Challenge Track. arXiv:2601.13754

work page arXiv 2026
[8]

Fügener, J

A. Fügener, J. Grahl, A. Gupta, and W. Ketter. Will Humans-in-the-Loop Become Borgs? Merits and Pitfalls of Working with AI.MIS Quarterly, 45(3b), pp. 1527– 1556, 2021. https://doi.org/10.25300/misq/2021/16553

work page doi:10.25300/misq/2021/16553 2021
[9]

About Protected Branches

GitHub. About Protected Branches. GitHub Docs, 2024. https: //docs.github.com/en/repositories/configuring-branches-and-merges-in- your-repository/managing-protected-branches/about-protected-branches

work page 2024
[10]

Glikson and A

E. Glikson and A. W. Woolley. Human Trust in Artificial Intelligence: Review of Empirical Research.Academy of Management Annals, 14(2), pp. 627–660, 2020. https://doi.org/10.5465/annals.2018.0057

work page doi:10.5465/annals.2018.0057 2020
[11]

Gousios, M

G. Gousios, M. Pinzger, and A. van Deursen. An Exploratory Study of the Pull-based Software Development Model. InProc. 36th Int. Conf. on Software Engineering (ICSE), pp. 345–355, 2014. https://doi.org/10.1145/2568225.2568260

work page doi:10.1145/2568225.2568260 2014
[12]

D. F. Halpern.Thought and Knowledge: An Introduction to Critical Thinking, 5th ed. Psychology Press, 2014

work page 2014
[13]

Hollan, E

J. Hollan, E. Hutchins, and D. Kirsh. Distributed Cognition: Toward a New Foundation for Human-Computer Interaction Research.ACM Trans. Comput.- Hum. Interact., 7(2), pp. 174–196, 2000. https://doi.org/10.1145/353485.353487

work page doi:10.1145/353485.353487 2000
[14]

Kalliamvakou, G

E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian. The Promises and Perils of Mining GitHub. InProc. 11th Working Conf. Mining Software Repositories (MSR), pp. 92–101, 2014. https://doi.org/10.1145/2597073. 2597074

work page doi:10.1145/2597073 2014
[15]

H. Li, H. Zhang, and A. E. Hassan. AIDev Challenge: Can You Predict Merge Decisions of AI-Coding-Agent Generated Pull Requests? InProc. IEEE/ACM Int. Conf. Mining Software Repositories (MSR), Mining Challenge, pp. 1–5, 2025. arXiv:2504.20423

work page arXiv 2025
[16]

H. Li, H. Zhang, and A. E. Hassan. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Soft- ware Engineering. InProc. ACM Joint European Software Engineering Conf. and Symp. on the Foundations of Software Engineering (ESEC/FSE), pp. 1–22, 2025. arXiv:2502.11387

work page arXiv 2025
[17]

A. F. Nogueira and M. Zenha-Rela. Monitoring a CI/CD Workflow Using Process Mining.SN Comput. Sci., 2(448), 2021. https://doi.org/10.1007/s42979-021-00830-2

work page doi:10.1007/s42979-021-00830-2 2021
[18]

Okamura and S

K. Okamura and S. Yamada. Adaptive Trust Calibration for Human-AI Collabo- ration.PLOS ONE, 15(2), e0229132, 2020. https://doi.org/10.1371/journal.pone. 0229132

work page doi:10.1371/journal.pone 2020
[19]

Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

G. Pinna, J. Gong, D. Williams, and F. Sarro. Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance. InProc. 23rd Int. Conf. Mining Software Repositories (MSR ’26), 2026. Mining Challenge Track. arXiv:2602.08915

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Rahman, M

S. Rahman, M. F. Rabbi, and M. F. Zibran. A Task-Level Evaluation of AI Agents in Open-Source Projects. InProc. 23rd Int. Conf. Mining Software Repositories (MSR ’26), 2026. Mining Challenge Track. arXiv:2602.02345. AIware ’26, July 06–07, 2026, Montreal, QC, Canada Chung and Hassan

work page arXiv 2026
[21]

Raisch and S

S. Raisch and S. Krakowski. Artificial Intelligence and Management: The Automation-Augmentation Paradox.Academy of Management Review, 46(1), pp. 192–210, 2021. https://doi.org/10.5465/amr.2018.0072

work page doi:10.5465/amr.2018.0072 2021
[22]

Roychoudhury et al

A. Roychoudhury et al. Agentic AI Software Engineers: Programming with Trust. arXiv preprint arXiv:2502.13767, 2025. https://doi.org/10.48550/arXiv.2502.13767

work page doi:10.48550/arxiv.2502.13767 2025
[23]

V. A. Rubin, A. A. Mitsyuk, I. A. Lomazova, and W. M. van der Aalst. Process Mining Can Be Applied to Software Too! InProc. 8th ACM/IEEE Int. Symp. on Empirical Software Engineering and Measurement (ESEM), pp. 1–4, 2014. https: //doi.org/10.1145/2652524.2652583

work page doi:10.1145/2652524.2652583 2014
[24]

V. A. Rubin and S. A. Shershakov. System Runs Analysis with Process Mining. InProc. 30th IEEE/ACM Int. Conf. on Automated Software Engineering Workshops (ASEW), pp. 48–51, 2015

work page 2015
[25]

T. B. Sheridan and W. L. Verplank. Human and Computer Control of Undersea Teleoperators. Tech. Rep., MIT Man-Machine Systems Laboratory, 1978

work page 1978
[26]

Treude, M.-A

C. Treude, M.-A. Storey, and J. Weber. Empirical Studies on Collaboration in Software Development: A Systematic Literature Review. Tech. Rep. DCS-352-IR, University of Victoria, 2012

work page 2012
[27]

Treude and M

C. Treude and M. A. Gerosa. How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in Software Engineering. InProc. 2nd IEEE/ACM Int. Conf. on AI Foundation Models and Software Engineering (Forge 2025), 2025. https://doi.org/10.1109/Forge66646.2025.00033

work page doi:10.1109/forge66646.2025.00033 2025
[28]

W. M. P. van der Aalst.Process Mining: Data Science in Action, 2nd ed. Springer, 2016

work page 2016