arxiv: 2605.02273 · v1 · submitted 2026-05-04 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

These Aren't the Reviews You're Looking For How Humans Review AI-Generated Pull Requests

Kacper Duma (1) , Patryk Wr\'oblewski (1) , Jagoda Bobi\'nska (1) , Julia Winiarska (1) , Piotr Przymus (1) ((1) Nicolaus Copernicus University in Toru\'n , Toru\'n , Poland)

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:14 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI-generated codepull requestscode reviewGitHubagentic workflowssoftware repositorieshuman oversight

0 comments

The pith

AI-generated pull requests mostly receive AI reviews or none at all, unlike human-authored code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares review patterns for AI-generated and human-written pull requests in the same GitHub repositories using the AIDev dataset. It shows that AI PRs are far more likely to go unreviewed or to be handled primarily by other AI agents, while human PRs draw more direct human feedback and human-only reviews. Humans involved with AI PRs often steer agents rather than evaluate code themselves. A sympathetic reader would care because many mining studies treat review counts and comments as signs of human quality control, yet this difference suggests those signals work differently in agentic settings.

Core claim

Using the AIDev dataset, most AI-generated PRs receive no review; when reviewed, AI agents dominate the activity rather than humans. Human-authored PRs are more likely to attract human-only review and direct human feedback. Reviews of AI PRs instead take the form of automation-mediated interaction, with human involvement often limited to agent steering instead of standalone evaluation.

What carries the argument

The structured comparison of review activity types (no review, human-only, AI-dominated, mixed) between AI-generated and human-authored PRs inside identical repositories.

If this is right

Review metrics from large-scale repository mining may overestimate the amount of human oversight applied to AI-generated code.
Agentic workflows produce review structures that differ systematically from traditional human developer processes.
Human oversight for AI PRs frequently occurs indirectly through steering rather than through direct code evaluation.
Studies that use review volume or comment counts as proxies for quality assurance need to account for AI participation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

As AI agents proliferate, traditional review-based metrics may become unreliable indicators of human involvement across more projects.
Project maintainers might need new signals or tooling to confirm that actual human judgment is applied to AI contributions.
This pattern could affect how open-source communities assign responsibility and trust when AI code enters the review pipeline.

Load-bearing premise

The AIDev dataset accurately labels which PRs were generated by AI and which reviews came from humans versus AI agents without major misclassification.

What would settle it

A manual inspection of several hundred PRs labeled AI-generated in the dataset that finds a high rate of human authorship or mislabeled reviewer types.

Figures

Figures reproduced from arXiv: 2605.02273 by Jagoda Bobi\'nska (1), Julia Winiarska (1), Kacper Duma (1), Patryk Wr\'oblewski (1), Piotr Przymus (1) ((1) Nicolaus Copernicus University in Toru\'n, Poland), Toru\'n.

**Figure 1.** Figure 1: Overview of the AIDev-based data collection view at source ↗

read the original abstract

We analyze code review interactions for AI-generated pull requests (PRs) on GitHub using the AIDev dataset and compare them to human-authored PRs within the same repositories. We find that most AI-generated PRs receive no review and, when reviewed, are largely dominated by AI agents rather than humans. Human-authored PRs are more likely to receive human-only review and to attract direct human feedback. In contrast, reviews of AI-generated PRs more often take the form of automation-mediated interaction, with human involvement frequently expressed through agent steering rather than standalone evaluation. These results indicate systematic differences in how review activity is structured in agentic workflows and raise challenges for interpreting review metrics as indicators of human oversight in large-scale mining studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AI PRs mostly get no review or AI-only reviews while human PRs get more human attention, but the whole contrast depends on unvalidated AIDev labels for AI content and reviewer type.

read the letter

This paper pulls the AIDev dataset and compares review activity on AI-generated pull requests against human ones inside the same GitHub repositories. The main finding is that AI PRs are usually left unreviewed, and when reviews happen they come mostly from other AI agents. Human PRs draw more direct human comments and human-only review threads. The authors also note that human input on AI PRs often shows up as steering an agent rather than standalone evaluation, which changes how review metrics should be read in big mining studies.

Referee Report

1 major / 2 minor

Summary. The paper analyzes code review interactions for AI-generated pull requests (PRs) on GitHub using the AIDev dataset and compares them to human-authored PRs in the same repositories. It reports that most AI-generated PRs receive no review and, when reviewed, are largely dominated by AI agents, while human-authored PRs are more likely to receive human-only review and direct human feedback. Reviews of AI PRs often involve automation-mediated interactions with human involvement via agent steering. The authors conclude that these patterns indicate systematic differences in agentic workflows and challenge the interpretation of review metrics as indicators of human oversight in large-scale mining studies.

Significance. If the dataset labels are reliable, the results provide valuable empirical insight into how review activity differs in AI-assisted development, with potential implications for software engineering research that relies on GitHub mining to study human oversight and collaboration. The work could help refine metrics used in empirical studies of open-source contributions.

major comments (1)

[Data and Methods] The central claims rest on the accuracy of the AIDev dataset in identifying AI-generated PRs and classifying reviews as human versus AI. The manuscript provides no validation procedure, error rates, inter-rater agreement, or sensitivity analysis for these labels (see Data and Methods sections). Any non-negligible misclassification correlated with repository or PR characteristics would produce spurious differences in the reported review structures, undermining the challenge to mining-study metrics.

minor comments (2)

[Abstract] The abstract is high-level and lacks any mention of statistical controls, sample sizes, or effect sizes; consider adding a brief quantitative summary to strengthen the high-level findings statement.
[Title] The title is missing punctuation for readability; consider 'These Aren't the Reviews You're Looking For: How Humans Review AI-Generated Pull Requests'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding dataset validation below and will revise the paper to incorporate additional details and analyses.

read point-by-point responses

Referee: The central claims rest on the accuracy of the AIDev dataset in identifying AI-generated PRs and classifying reviews as human versus AI. The manuscript provides no validation procedure, error rates, inter-rater agreement, or sensitivity analysis for these labels (see Data and Methods sections). Any non-negligible misclassification correlated with repository or PR characteristics would produce spurious differences in the reported review structures, undermining the challenge to mining-study metrics.

Authors: We agree that the reliability of the AIDev dataset labels is foundational to our claims and that explicit discussion of validation strengthens the work. The dataset was introduced and documented in prior work, which reports on its construction, labeling process, and some validation steps including inter-rater agreement. To address this concern directly, we will revise the Data and Methods section to summarize those validation procedures, include available error rates and agreement metrics, and add a sensitivity analysis examining the robustness of our reported differences under varying levels of potential misclassification. We will also discuss the possibility of correlations between misclassification and repository or PR characteristics as an explicit limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison on external dataset

full rationale

The paper performs a straightforward empirical analysis by applying descriptive statistics and comparisons to review patterns in the AIDev dataset for AI-generated versus human PRs. No equations, fitted parameters, predictions, or derivations are present. Claims about systematic differences in review structures follow directly from observed data frequencies without any self-definitional reduction, renaming of known results, or load-bearing self-citations. The analysis is self-contained against the external dataset and does not reduce its central findings to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's conclusions depend on the assumption that the AIDev dataset provides reliable labels for AI-generated content and review participants.

axioms (1)

domain assumption The AIDev dataset accurately labels AI-generated PRs and review interactions as human or AI.
The analysis relies on this dataset for all comparisons and findings.

pith-pipeline@v0.9.0 · 5470 in / 1166 out tokens · 27560 ms · 2026-05-08T18:14:34.871801+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Rethinking Code Review Workflows with LLM Assistance: An Empirical Study, May 2025

Fannar Steinn Aðalsteinsson, Björn Borgar Magnússon, Mislav Milicevic, Adam Nirving Davidsson, and Chih-Hong Cheng. Rethinking Code Review Workflows with LLM Assistance: An Empirical Study, May 2025

2025
[2]

Adam Alami and Neil A. Ernst. Human and machine: How software engineers perceive and engage with ai-assisted code reviews compared to their peers, 2025

2025
[3]

Expectations, outcomes, and challenges of modern code review

Alberto Bacchelli and Christian Bird. Expectations, outcomes, and challenges of modern code review. In2013 35th International Conference on Software Engineering (ICSE), pages 712–721, San Francisco, CA, USA, May 2013. IEEE

2013
[4]

Modern Code Reviews—Survey of Literature and Practice.ACM Trans

Deepika Badampudi, Michael Unterkalmsteiner, and Ricardo Britto. Modern Code Reviews—Survey of Literature and Practice.ACM Trans. Softw. Eng. Methodol., 32(4):107:1–107:61, May 2023

2023
[5]

Alperen Çetin, Emre Doğan, and Eray Tüzün

H. Alperen Çetin, Emre Doğan, and Eray Tüzün. A review of code reviewer recommendation studies: Challenges and future directions.Science of Computer Programming, 208:102652, August 2021

2021
[6]

Automated code review in practice, 2024

Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emir- can Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. Automated code review in practice, 2024

2024
[7]

A systematic literature review and taxonomy of modern code review.Journal of Systems and Software, 177:110951, July 2021

Nicole Davila and Ingrid Nunes. A systematic literature review and taxonomy of modern code review.Journal of Systems and Software, 177:110951, July 2021

2021
[8]

Jingzhi Gong, Giovanni Pinna, Yixin Bian, and Jie M. Zhang. Analyzing message- code inconsistency in ai coding agent-authored pull requests, 2026

2026
[9]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering.arXiv preprint arXiv:2507.15003, 2025

work page internal anchor Pith review arXiv 2025
[10]

Rigby and Christian Bird

Peter C. Rigby and Christian Bird. Convergent contemporary software peer review practices. InProceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 202–212, New York, NY, USA, August 2013. Association for Computing Machinery

2013
[11]

Ten simple rules for scientific code review.PLOS Computational Biology, 20(9):e1012375, September 2024

Ariel Rokem. Ten simple rules for scientific code review.PLOS Computational Biology, 20(9):e1012375, September 2024

2024
[12]

Modern code review: a case study at google

Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli. Modern code review: a case study at google. InProceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP ’18, page 181–190, New York, NY, USA, 2018. Association for Computing Machinery

2018
[13]

The value of effective pull request description, 2026

Alberto Bacchelli Shirin Pirouzkhah, Pavlína Wurzel Gonçalves. The value of effective pull request description, 2026

2026
[14]

Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions, August 2025

Kexin Sun, Hongyu Kuang, Sebastian Baltes, Xin Zhou, He Zhang, Xiaoxing Ma, Guoping Rong, Dong Shao, and Christoph Treude. Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions, August 2025

2025
[15]

Bissyande

Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and Tegawende F. Bissyande. CodeAgent: Autonomous Communicative Agents for Code Review, September 2024

2024
[16]

Code review automation: Strengths and weaknesses of the state of the art, 2024

Rosalia Tufano, Ozren Dabić, Antonio Mastropaolo, Matteo Ciniselli, and Gabriele Bavota. Code review automation: Strengths and weaknesses of the state of the art, 2024

2024
[17]

Towards automated classification of code review feedback to support analytics, 2023

Asif Kamal Turzo, Fahim Faysal, Ovi Poddar, Jaydeb Sarker, Anindya Iqbal, and Amiangshu Bosu. Towards automated classification of code review feedback to support analytics, 2023

2023
[18]

Miku Watanabe, Yutaro Kashiwa, Bin Lin, Toshiki Hirao, Ken’Ichi Yamaguchi, and Hajimu Iida.On the Use of ChatGPT for Code Review: Do Developers Like Reviews By ChatGPT?June 2024

2024