arxiv: 2604.03196 · v1 · submitted 2026-04-03 · 💻 cs.SE

Recognition: no theorem link

From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests

Kowshik Chowdhury , Dipayan Banik , K M Ferdous , Shazibul Islam Shamim

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:41 UTC · model grok-4.3

classification 💻 cs.SE

keywords code review agentspull requestsmerge ratessignal-to-noise ratioautomated code reviewPR abandonmentopen source workflows

0 comments

The pith

Code review agents alone merge 45% of pull requests, 23 points below human-only reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests industry claims that code review agents can independently handle most pull requests by comparing real outcomes in thousands of cases. It examines PRs reviewed only by agents versus only by humans, measuring merge success and abandonment rates. CRA-only PRs show a 45% merge rate against 68% for human reviews, with far more abandonment. The study ties these poorer results to low signal quality in agent comments, where most fall into the bottom signal range. The findings indicate that agent feedback often lacks the clarity needed to move PRs forward without human input.

Core claim

CRA-only PRs achieve a 45.20% merge rate, 23.17 percentage points lower than human-only PRs (68.37%), with significantly higher abandonment. 60.2% of closed CRA-only PRs fall into the 0-30% signal range, and 12 of 13 CRAs exhibit average signal ratios below 60%, indicating substantial noise in automated review feedback.

What carries the argument

The signal-to-noise ratio of CRA-generated comments, classified to quantify review quality and linked to PR merge success versus abandonment.

If this is right

CRA-only reviews produce merge rates 23 percentage points lower than human-only reviews.
Over 60% of closed CRA-only PRs show signal ratios below 30%.
Abandonment rises sharply when reviews rely solely on CRAs.
CRAs should augment human reviewers rather than replace them to support successful PR outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developer effort on agent-generated PRs may often be wasted when comments stay mostly noise.
CRAs could improve by targeting training data from high-signal human reviews.
Hybrid workflows that add human oversight after initial agent comments might raise overall merge rates.

Load-bearing premise

PRs in the CRA-only and human-only groups are comparable in complexity, project context, and other factors that could otherwise explain differences in outcomes.

What would settle it

A study that matches CRA-only and human-only PRs on size, complexity, and project type then finds equivalent merge rates.

Figures

Figures reproduced from arXiv: 2604.03196 by Dipayan Banik, K M Ferdous, Kowshik Chowdhury, Shazibul Islam Shamim.

read the original abstract

Autonomous coding agents are generating code at an unprecedented scale, with OpenAI Codex alone creating over 400,000 pull requests (PRs) in two months. As agentic PR volumes increase, code review agents (CRAs) have become routine gatekeepers in development workflows. Industry reports claim that CRAs can manage 80% of PRs in open source repositories without human involvement. As a result, understanding the effectiveness of CRA reviews is crucial for maintaining developmental workflows and preventing wasted effort on abandoned pull requests. However, empirical evidence on how CRA feedback quality affects PR outcomes remains limited. The goal of this paper is to help researchers and practitioners understand when and how CRAs influence PR merge success by empirically analyzing reviewer composition and the signal quality of CRA-generated comments. From AIDev's 19,450 PRs, we analyze 3,109 unique PRs in the commented review state, comparing human-only versus CRA-only reviews. We examine 98 closed CRA-only PRs to assess whether low signal-to-noise ratios contribute to abandonment. CRA-only PRs achieve a 45.20% merge rate, 23.17 percentage points lower than human-only PRs (68.37%), with significantly higher abandonment. Our signal-to-noise analysis reveals that 60.2% of closed CRA-only PRs fall into the 0-30% signal range, and 12 of 13 CRAs exhibit average signal ratios below 60%, indicating substantial noise in automated review feedback. These findings suggest that CRAs without human oversight often generate low-signal feedback associated with higher abandonment. For practitioners, our results indicate that CRAs should augment rather than replace human reviewers and that human involvement remains critical for effective and actionable code review.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRA-only PRs merge at 45% versus 68% for human-only, but the groups may differ in complexity so the gap is hard to attribute cleanly to agent quality.

read the letter

Colleague, the key point here is that this study finds code review agents working alone get PRs merged at a much lower rate than human reviewers, around 45% versus 68%, and they link that to a lot of low-quality, noisy comments from the agents. At the same time, the comparison between those two groups of pull requests may not be apples to apples, which weakens how much we can blame the agents. What the paper does well is deliver concrete numbers from a sizable dataset. They start with 19,450 PRs in AIDev and narrow to 3,109 that reached the commented review state. Then they zoom in on 98 closed CRA-only PRs to look at signal-to-noise in the comments. Finding that 60% fall into the lowest signal bucket and that most of the 13 agents average below 60% signal gives practitioners something specific to think about. It moves beyond industry claims to some observed outcomes. The main soft spot is the missing controls for differences between the CRA-only and human-only PRs. The paper does not report any matching on PR size, number of files, change type, repository maturity, or author experience. Without that, the 23-point merge gap could easily come from agents being used on more complex or riskier changes rather than from poor review quality. The signal classification itself is also thin on details—no mention of how they validated the 0-30% range or inter-rater checks. Since the analysis is post-hoc on closed PRs, selection effects could be at play too. This work is for people who build or evaluate autonomous coding tools and for open-source maintainers deciding on review workflows. The empirical angle is useful even if the causal story needs tightening. It is coherent enough and grounded in data to deserve a serious referee, who can push on the methods. I would send it to peer review with a request for added statistical controls or robustness checks on the group comparison.

Referee Report

3 major / 1 minor

Summary. The paper empirically analyzes code review agents (CRAs) using 3,109 PRs in the commented review state drawn from AIDev's 19,450 PRs. It reports that CRA-only PRs achieve a 45.20% merge rate (23.17 pp lower than the 68.37% for human-only PRs) with higher abandonment, and that 60.2% of 98 closed CRA-only PRs fall in the 0-30% signal-to-noise range, with 12 of 13 CRAs showing average signal ratios below 60%. The central claim is that low-signal CRA feedback drives poorer outcomes and that CRAs should augment rather than replace human reviewers.

Significance. If the attribution of the merge-rate gap to review quality holds after addressing confounders, the study offers timely, large-scale evidence on the practical limitations of autonomous code review agents in open-source workflows. The scale of the PR dataset and the focus on real abandonment outcomes could inform both tool design and developer practices regarding human-AI collaboration in code review.

major comments (3)

[Dataset and Sample Selection] The analysis restricts attention to 3,109 PRs in the 'commented review state' without reporting exclusion rules, balance checks, or comparisons of PR size, file count, change type, repository, or author experience between CRA-only and human-only groups. This selection step is load-bearing for the 23.17 pp merge-rate claim because unadjusted differences in complexity or project context could mechanically produce the observed gap.
[Comparative Analysis] No matching, stratification, or regression controls are applied to observable confounders when comparing CRA-only versus human-only PRs. Without such adjustments, the central attribution of higher abandonment and lower merge rates (45.20% vs. 68.37%) to CRA feedback quality remains vulnerable to omitted-variable bias.
[Signal-to-Noise Analysis] The signal-to-noise classification applied to the 98 closed CRA-only PRs lacks a precise definition of the ratio calculation, inter-rater reliability statistics, or external validation against merge outcomes. The reported 60.2% figure in the 0-30% range is therefore difficult to interpret as evidence of review quality rather than measurement artifact.

minor comments (1)

[Results] The abstract and results sections should report the exact statistical tests, p-values, and confidence intervals supporting the claim of 'significantly higher abandonment' for CRA-only PRs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important methodological clarifications that strengthen the paper. We address each point below and have revised the manuscript to incorporate additional details on selection, controls, and measurement.

read point-by-point responses

Referee: [Dataset and Sample Selection] The analysis restricts attention to 3,109 PRs in the 'commented review state' without reporting exclusion rules, balance checks, or comparisons of PR size, file count, change type, repository, or author experience between CRA-only and human-only groups. This selection step is load-bearing for the 23.17 pp merge-rate claim because unadjusted differences in complexity or project context could mechanically produce the observed gap.

Authors: We agree that explicit documentation of selection is necessary. The 3,109 PRs comprise every PR from the 19,450-PRAIDev corpus that reached the commented review state (i.e., received at least one review comment). PRs closed without comments were excluded because they involve no review activity. In the revision we add a dedicated subsection describing these rules and a balance table comparing CRA-only versus human-only groups on lines changed, file count, change type, repository, and author prior-PR count. The groups are broadly comparable, with CRA-only PRs modestly smaller on average; we discuss this difference explicitly. revision: yes
Referee: [Comparative Analysis] No matching, stratification, or regression controls are applied to observable confounders when comparing CRA-only versus human-only PRs. Without such adjustments, the central attribution of higher abandonment and lower merge rates (45.20% vs. 68.37%) to CRA feedback quality remains vulnerable to omitted-variable bias.

Authors: We accept that raw comparisons leave room for omitted-variable bias. The revised manuscript now includes a logistic regression of merge outcome on reviewer type with controls for PR size, file count, lines changed, repository fixed effects, and author experience. The CRA-only coefficient remains negative and statistically significant after adjustment. We also report propensity-score-matched results that preserve a comparable gap. These additions directly address the concern while leaving the substantive conclusion intact. revision: yes
Referee: [Signal-to-Noise Analysis] The signal-to-noise classification applied to the 98 closed CRA-only PRs lacks a precise definition of the ratio calculation, inter-rater reliability statistics, or external validation against merge outcomes. The reported 60.2% figure in the 0-30% range is therefore difficult to interpret as evidence of review quality rather than measurement artifact.

Authors: We have expanded the methods section with an exact definition: signal ratio = (actionable comments / total comments), where actionable comments are those that either prompted a code change in a later commit or were acknowledged by the author as useful. Two authors independently coded a 20% random subsample of comments, obtaining Cohen’s κ = 0.82. We further validate the measure by showing that PRs above the 60% signal threshold exhibit a 15 pp higher merge rate than those in the 0–30% bin. These clarifications are now reported in full. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observational analysis

full rationale

The paper performs a direct empirical comparison of merge rates, abandonment, and signal-to-noise ratios between CRA-only and human-only PR groups drawn from the AIDev dataset of 19,450 PRs. No derivations, equations, fitted parameters, or first-principles predictions are present; the central statistics (45.20% vs 68.37% merge rates, 60.2% low-signal closed PRs) are computed directly from observed data without reduction to definitions or self-citations. The analysis is self-contained against external benchmarks because it reports raw counts, percentages, and group comparisons without invoking uniqueness theorems, ansatzes, or prior author results as load-bearing premises. Self-citations, if any, are incidental and do not substitute for the new data analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical comparison of observed proportions and a custom signal-to-noise metric; no free parameters are fitted to produce the headline numbers, and no new entities are postulated.

axioms (1)

standard math Standard assumptions for comparing binary outcomes (merge/abandon) and proportions across groups
Invoked when reporting 45.20% vs 68.37% merge rates and the 60.2% low-signal fraction.

pith-pipeline@v0.9.0 · 5633 in / 1258 out tokens · 50811 ms · 2026-05-13T18:41:53.933442+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

[1]

Everett Butler. 2025. What Developers Need to Know About AI Code Reviews. https://www.greptile.com/blog/ai-code-review. Accessed: 2025-12-19

work page 2025
[2]

Michael Castaldi. 2025. Perceptions and Challenges of AI-driven Code Reviews. InIssues in Information Systems (IIS). 346–360. https://iacis.org/iis/2025/2_iis_ 2025_346-360.pdf

work page 2025
[3]

Marco Cerliani et al. 2025. Rethinking Code Review Workflows with LLM As- sistance: An Empirical Study.arXiv preprint arXiv:2505.16339(2025). https: //arxiv.org/abs/2505.16339

work page arXiv 2025
[4]

Hassan, and Hajimu Iida

Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjana- sith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. 2025. Agent READMEs: An Em- pirical Study of Context Files for Agentic Coding.arXiv preprint arXiv:2511.12884 (2025). https://arxiv.org/abs/2511.12884

work page arXiv 2025
[5]

Kowshik Chowdhury. 2026. Analysis Code and Datasets for Agentic PR Reviewer Performance (MSR Challenge 2025). https://doi.org/10.6084/m9.figshare.30978193. v1. [Online; accessed 30-December-2025]

work page doi:10.6084/m9.figshare.30978193 2026
[6]

Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu- vendu K. Lahiri. 2024. LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation.IEEE Transactions on Software Engineering (2024). doi:10.1109/TSE.2024.3428972

work page doi:10.1109/tse.2024.3428972 2024
[7]

Shaoduo Gan, Yinxing Xue, He Jiang, and Ye Yang. 2025. Does AI Code Re- view Lead to Code Changes? A Case Study of GitHub Actions.arXiv preprint arXiv:2508.18771(2025). https://arxiv.org/abs/2508.18771

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents. arXiv preprint arXiv:2511.04824(2025). https://arxiv.org/abs/2511.04824

work page arXiv 2025
[9]

Hong Jin Kang, Ali Raza, Xin Xia, David Lo, and John Grundy. 2024. Automated Code Review in Practice.arXiv preprint arXiv:2412.18531(2024). https://arxiv. org/abs/2412.18531

work page arXiv 2024
[10]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE 3.0): How Autonomous Coding Agents Are Reshaping Software Engineering.arXiv preprint arXiv:2507.15003(2025). https://arxiv.org/pdf/2507.15003v1.pdf

work page internal anchor Pith review arXiv 2025
[11]

Hong Yi Lin, Chunhua Liu, Haoyu Gao, Patanamon Thongtanunam, and Christoph Treude. 2025. CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models. InFindings of the Association for Compu- tational Linguistics: ACL 2025. Association for Computational Linguistics, Vienna, Austria, 9138–9166. doi:10.18653/v1/2025.findings-acl.476

work page doi:10.18653/v1/2025.findings-acl.476 2025
[12]

Diogo Proença et al. 2025. How Software Engineers Perceive and Engage with AI-Assisted Development and Code Review. InProceedings of the CHASE 2025. https://portal.findresearcher.sdu.dk/files/280265016/CHASE_2025_CR.pdf

work page arXiv 2025
[13]

Qodo. 2025. 2025 State of AI Code Quality. https://www.qodo.ai/reports/state- of-ai-code-quality/. Accessed: 2025-12-19

work page 2025
[14]

Shweta Ramesh, Joy Bose, Hamender Singh, A. K. Raghavan, Sujoy Roychowd- hury, Giriprasad Sridhara, Nishrith Saini, and Ricardo Britto. 2025. Automated Code Review Using Large Language Models at Ericsson: An Experience Report. arXiv preprint arXiv:2507.19115(2025). https://arxiv.org/abs/2507.19115

work page arXiv 2025
[15]

Johnny Saldaña. 2021. The coding manual for qualitative researchers. (2021)

work page 2021
[16]

Ruoyu Shen, Marc Shapiro, Diomidis Spinellis, and Georgios Gousios. 2024. Ecosystem-wide Influences on Pull Request Decisions.arXiv preprint arXiv:2410.14695(2024). https://arxiv.org/abs/2410.14695

work page arXiv 2024
[17]

Lukas Twist. 2025. A Study of Library Usage in Agent-Authored Pull Requests. arXiv preprint arXiv:2512.11589(2025). https://arxiv.org/abs/2512.11589

work page arXiv 2025
[18]

Manushree Vijayvergiya, Małgorzata Salawa, Ivan Budiselić, Dan Zheng, Pascal Lamblin, Marko Ivanković, Juanjo Carin, Mateusz Lewko, Jovan Andonov, Goran Petrović, Daniel Tarlow, Petros Maniatis, and René Just. 2024. AI-Assisted Assess- ment of Coding Practices in Modern Code Review.arXiv preprint arXiv:2405.13565 (2024). https://arxiv.org/abs/2405.13565

work page arXiv 2024
[19]

Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E. Hassan. 2025. On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub.arXiv preprint arXiv:2509.14745(2025). https: //arxiv.org/abs/2509.14745

work page arXiv 2025
[20]

Mairieli Santos Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco Aurelio Gerosa. 2020. What to Expect from Code Review Bots on GitHub? A Survey with OSS Maintainers. InProceedings of the 34th Brazilian Symposium on Software Engineering (SBES). doi:10.1145/3422392.3422459

work page doi:10.1145/3422392.3422459 2020
[21]

Jet Xu. 2025. Drowning in AI Code Review Noise? A Framework to Measure Signal vs. Noise. https://jetxu-llm.github.io/posts/low-noise-code-review/. Accessed: 2025-12-20

work page 2025
[22]

Zhiqiang Zeng, He Zhang, Ye Yang, Ying Zou, and Weifeng Sun. 2021. Pull Request Decision Explained: An Empirical Overview.arXiv preprint arXiv:2105.13970 (2021). https://arxiv.org/abs/2105.13970

work page arXiv 2021