pith. machine review for the scientific record. sign in

arxiv: 2604.03196 · v1 · submitted 2026-04-03 · 💻 cs.SE

Recognition: no theorem link

From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:41 UTC · model grok-4.3

classification 💻 cs.SE
keywords code review agentspull requestsmerge ratessignal-to-noise ratioautomated code reviewPR abandonmentopen source workflows
0
0 comments X

The pith

Code review agents alone merge 45% of pull requests, 23 points below human-only reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests industry claims that code review agents can independently handle most pull requests by comparing real outcomes in thousands of cases. It examines PRs reviewed only by agents versus only by humans, measuring merge success and abandonment rates. CRA-only PRs show a 45% merge rate against 68% for human reviews, with far more abandonment. The study ties these poorer results to low signal quality in agent comments, where most fall into the bottom signal range. The findings indicate that agent feedback often lacks the clarity needed to move PRs forward without human input.

Core claim

CRA-only PRs achieve a 45.20% merge rate, 23.17 percentage points lower than human-only PRs (68.37%), with significantly higher abandonment. 60.2% of closed CRA-only PRs fall into the 0-30% signal range, and 12 of 13 CRAs exhibit average signal ratios below 60%, indicating substantial noise in automated review feedback.

What carries the argument

The signal-to-noise ratio of CRA-generated comments, classified to quantify review quality and linked to PR merge success versus abandonment.

If this is right

  • CRA-only reviews produce merge rates 23 percentage points lower than human-only reviews.
  • Over 60% of closed CRA-only PRs show signal ratios below 30%.
  • Abandonment rises sharply when reviews rely solely on CRAs.
  • CRAs should augment human reviewers rather than replace them to support successful PR outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developer effort on agent-generated PRs may often be wasted when comments stay mostly noise.
  • CRAs could improve by targeting training data from high-signal human reviews.
  • Hybrid workflows that add human oversight after initial agent comments might raise overall merge rates.

Load-bearing premise

PRs in the CRA-only and human-only groups are comparable in complexity, project context, and other factors that could otherwise explain differences in outcomes.

What would settle it

A study that matches CRA-only and human-only PRs on size, complexity, and project type then finds equivalent merge rates.

Figures

Figures reproduced from arXiv: 2604.03196 by Dipayan Banik, K M Ferdous, Kowshik Chowdhury, Shazibul Islam Shamim.

Figure 1
Figure 1. Figure 1: Signal-to-noise ratio distribution across closed CRA [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Autonomous coding agents are generating code at an unprecedented scale, with OpenAI Codex alone creating over 400,000 pull requests (PRs) in two months. As agentic PR volumes increase, code review agents (CRAs) have become routine gatekeepers in development workflows. Industry reports claim that CRAs can manage 80% of PRs in open source repositories without human involvement. As a result, understanding the effectiveness of CRA reviews is crucial for maintaining developmental workflows and preventing wasted effort on abandoned pull requests. However, empirical evidence on how CRA feedback quality affects PR outcomes remains limited. The goal of this paper is to help researchers and practitioners understand when and how CRAs influence PR merge success by empirically analyzing reviewer composition and the signal quality of CRA-generated comments. From AIDev's 19,450 PRs, we analyze 3,109 unique PRs in the commented review state, comparing human-only versus CRA-only reviews. We examine 98 closed CRA-only PRs to assess whether low signal-to-noise ratios contribute to abandonment. CRA-only PRs achieve a 45.20% merge rate, 23.17 percentage points lower than human-only PRs (68.37%), with significantly higher abandonment. Our signal-to-noise analysis reveals that 60.2% of closed CRA-only PRs fall into the 0-30% signal range, and 12 of 13 CRAs exhibit average signal ratios below 60%, indicating substantial noise in automated review feedback. These findings suggest that CRAs without human oversight often generate low-signal feedback associated with higher abandonment. For practitioners, our results indicate that CRAs should augment rather than replace human reviewers and that human involvement remains critical for effective and actionable code review.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper empirically analyzes code review agents (CRAs) using 3,109 PRs in the commented review state drawn from AIDev's 19,450 PRs. It reports that CRA-only PRs achieve a 45.20% merge rate (23.17 pp lower than the 68.37% for human-only PRs) with higher abandonment, and that 60.2% of 98 closed CRA-only PRs fall in the 0-30% signal-to-noise range, with 12 of 13 CRAs showing average signal ratios below 60%. The central claim is that low-signal CRA feedback drives poorer outcomes and that CRAs should augment rather than replace human reviewers.

Significance. If the attribution of the merge-rate gap to review quality holds after addressing confounders, the study offers timely, large-scale evidence on the practical limitations of autonomous code review agents in open-source workflows. The scale of the PR dataset and the focus on real abandonment outcomes could inform both tool design and developer practices regarding human-AI collaboration in code review.

major comments (3)
  1. [Dataset and Sample Selection] The analysis restricts attention to 3,109 PRs in the 'commented review state' without reporting exclusion rules, balance checks, or comparisons of PR size, file count, change type, repository, or author experience between CRA-only and human-only groups. This selection step is load-bearing for the 23.17 pp merge-rate claim because unadjusted differences in complexity or project context could mechanically produce the observed gap.
  2. [Comparative Analysis] No matching, stratification, or regression controls are applied to observable confounders when comparing CRA-only versus human-only PRs. Without such adjustments, the central attribution of higher abandonment and lower merge rates (45.20% vs. 68.37%) to CRA feedback quality remains vulnerable to omitted-variable bias.
  3. [Signal-to-Noise Analysis] The signal-to-noise classification applied to the 98 closed CRA-only PRs lacks a precise definition of the ratio calculation, inter-rater reliability statistics, or external validation against merge outcomes. The reported 60.2% figure in the 0-30% range is therefore difficult to interpret as evidence of review quality rather than measurement artifact.
minor comments (1)
  1. [Results] The abstract and results sections should report the exact statistical tests, p-values, and confidence intervals supporting the claim of 'significantly higher abandonment' for CRA-only PRs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important methodological clarifications that strengthen the paper. We address each point below and have revised the manuscript to incorporate additional details on selection, controls, and measurement.

read point-by-point responses
  1. Referee: [Dataset and Sample Selection] The analysis restricts attention to 3,109 PRs in the 'commented review state' without reporting exclusion rules, balance checks, or comparisons of PR size, file count, change type, repository, or author experience between CRA-only and human-only groups. This selection step is load-bearing for the 23.17 pp merge-rate claim because unadjusted differences in complexity or project context could mechanically produce the observed gap.

    Authors: We agree that explicit documentation of selection is necessary. The 3,109 PRs comprise every PR from the 19,450-PRAIDev corpus that reached the commented review state (i.e., received at least one review comment). PRs closed without comments were excluded because they involve no review activity. In the revision we add a dedicated subsection describing these rules and a balance table comparing CRA-only versus human-only groups on lines changed, file count, change type, repository, and author prior-PR count. The groups are broadly comparable, with CRA-only PRs modestly smaller on average; we discuss this difference explicitly. revision: yes

  2. Referee: [Comparative Analysis] No matching, stratification, or regression controls are applied to observable confounders when comparing CRA-only versus human-only PRs. Without such adjustments, the central attribution of higher abandonment and lower merge rates (45.20% vs. 68.37%) to CRA feedback quality remains vulnerable to omitted-variable bias.

    Authors: We accept that raw comparisons leave room for omitted-variable bias. The revised manuscript now includes a logistic regression of merge outcome on reviewer type with controls for PR size, file count, lines changed, repository fixed effects, and author experience. The CRA-only coefficient remains negative and statistically significant after adjustment. We also report propensity-score-matched results that preserve a comparable gap. These additions directly address the concern while leaving the substantive conclusion intact. revision: yes

  3. Referee: [Signal-to-Noise Analysis] The signal-to-noise classification applied to the 98 closed CRA-only PRs lacks a precise definition of the ratio calculation, inter-rater reliability statistics, or external validation against merge outcomes. The reported 60.2% figure in the 0-30% range is therefore difficult to interpret as evidence of review quality rather than measurement artifact.

    Authors: We have expanded the methods section with an exact definition: signal ratio = (actionable comments / total comments), where actionable comments are those that either prompted a code change in a later commit or were acknowledged by the author as useful. Two authors independently coded a 20% random subsample of comments, obtaining Cohen’s κ = 0.82. We further validate the measure by showing that PRs above the 60% signal threshold exhibit a 15 pp higher merge rate than those in the 0–30% bin. These clarifications are now reported in full. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observational analysis

full rationale

The paper performs a direct empirical comparison of merge rates, abandonment, and signal-to-noise ratios between CRA-only and human-only PR groups drawn from the AIDev dataset of 19,450 PRs. No derivations, equations, fitted parameters, or first-principles predictions are present; the central statistics (45.20% vs 68.37% merge rates, 60.2% low-signal closed PRs) are computed directly from observed data without reduction to definitions or self-citations. The analysis is self-contained against external benchmarks because it reports raw counts, percentages, and group comparisons without invoking uniqueness theorems, ansatzes, or prior author results as load-bearing premises. Self-citations, if any, are incidental and do not substitute for the new data analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical comparison of observed proportions and a custom signal-to-noise metric; no free parameters are fitted to produce the headline numbers, and no new entities are postulated.

axioms (1)
  • standard math Standard assumptions for comparing binary outcomes (merge/abandon) and proportions across groups
    Invoked when reporting 45.20% vs 68.37% merge rates and the 60.2% low-signal fraction.

pith-pipeline@v0.9.0 · 5633 in / 1258 out tokens · 50811 ms · 2026-05-13T18:41:53.933442+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

  1. [1]

    Everett Butler. 2025. What Developers Need to Know About AI Code Reviews. https://www.greptile.com/blog/ai-code-review. Accessed: 2025-12-19

  2. [2]

    Michael Castaldi. 2025. Perceptions and Challenges of AI-driven Code Reviews. InIssues in Information Systems (IIS). 346–360. https://iacis.org/iis/2025/2_iis_ 2025_346-360.pdf

  3. [3]

    Marco Cerliani et al. 2025. Rethinking Code Review Workflows with LLM As- sistance: An Empirical Study.arXiv preprint arXiv:2505.16339(2025). https: //arxiv.org/abs/2505.16339

  4. [4]

    Hassan, and Hajimu Iida

    Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjana- sith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. 2025. Agent READMEs: An Em- pirical Study of Context Files for Agentic Coding.arXiv preprint arXiv:2511.12884 (2025). https://arxiv.org/abs/2511.12884

  5. [5]

    Kowshik Chowdhury. 2026. Analysis Code and Datasets for Agentic PR Reviewer Performance (MSR Challenge 2025). https://doi.org/10.6084/m9.figshare.30978193. v1. [Online; accessed 30-December-2025]

  6. [6]

    Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu- vendu K. Lahiri. 2024. LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation.IEEE Transactions on Software Engineering (2024). doi:10.1109/TSE.2024.3428972

  7. [7]

    Shaoduo Gan, Yinxing Xue, He Jiang, and Ye Yang. 2025. Does AI Code Re- view Lead to Code Changes? A Case Study of GitHub Actions.arXiv preprint arXiv:2508.18771(2025). https://arxiv.org/abs/2508.18771

  8. [8]

    Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents. arXiv preprint arXiv:2511.04824(2025). https://arxiv.org/abs/2511.04824

  9. [9]

    Hong Jin Kang, Ali Raza, Xin Xia, David Lo, and John Grundy. 2024. Automated Code Review in Practice.arXiv preprint arXiv:2412.18531(2024). https://arxiv. org/abs/2412.18531

  10. [10]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE 3.0): How Autonomous Coding Agents Are Reshaping Software Engineering.arXiv preprint arXiv:2507.15003(2025). https://arxiv.org/pdf/2507.15003v1.pdf

  11. [11]

    Hong Yi Lin, Chunhua Liu, Haoyu Gao, Patanamon Thongtanunam, and Christoph Treude. 2025. CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models. InFindings of the Association for Compu- tational Linguistics: ACL 2025. Association for Computational Linguistics, Vienna, Austria, 9138–9166. doi:10.18653/v1/2025.findings-acl.476

  12. [12]

    Diogo Proença et al. 2025. How Software Engineers Perceive and Engage with AI-Assisted Development and Code Review. InProceedings of the CHASE 2025. https://portal.findresearcher.sdu.dk/files/280265016/CHASE_2025_CR.pdf

  13. [13]

    Qodo. 2025. 2025 State of AI Code Quality. https://www.qodo.ai/reports/state- of-ai-code-quality/. Accessed: 2025-12-19

  14. [14]

    Shweta Ramesh, Joy Bose, Hamender Singh, A. K. Raghavan, Sujoy Roychowd- hury, Giriprasad Sridhara, Nishrith Saini, and Ricardo Britto. 2025. Automated Code Review Using Large Language Models at Ericsson: An Experience Report. arXiv preprint arXiv:2507.19115(2025). https://arxiv.org/abs/2507.19115

  15. [15]

    Johnny Saldaña. 2021. The coding manual for qualitative researchers. (2021)

  16. [16]

    Ruoyu Shen, Marc Shapiro, Diomidis Spinellis, and Georgios Gousios. 2024. Ecosystem-wide Influences on Pull Request Decisions.arXiv preprint arXiv:2410.14695(2024). https://arxiv.org/abs/2410.14695

  17. [17]

    Lukas Twist. 2025. A Study of Library Usage in Agent-Authored Pull Requests. arXiv preprint arXiv:2512.11589(2025). https://arxiv.org/abs/2512.11589

  18. [18]

    Manushree Vijayvergiya, Małgorzata Salawa, Ivan Budiselić, Dan Zheng, Pascal Lamblin, Marko Ivanković, Juanjo Carin, Mateusz Lewko, Jovan Andonov, Goran Petrović, Daniel Tarlow, Petros Maniatis, and René Just. 2024. AI-Assisted Assess- ment of Coding Practices in Modern Code Review.arXiv preprint arXiv:2405.13565 (2024). https://arxiv.org/abs/2405.13565

  19. [19]

    Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E. Hassan. 2025. On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub.arXiv preprint arXiv:2509.14745(2025). https: //arxiv.org/abs/2509.14745

  20. [20]

    Mairieli Santos Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco Aurelio Gerosa. 2020. What to Expect from Code Review Bots on GitHub? A Survey with OSS Maintainers. InProceedings of the 34th Brazilian Symposium on Software Engineering (SBES). doi:10.1145/3422392.3422459

  21. [21]

    Jet Xu. 2025. Drowning in AI Code Review Noise? A Framework to Measure Signal vs. Noise. https://jetxu-llm.github.io/posts/low-noise-code-review/. Accessed: 2025-12-20

  22. [22]

    Zhiqiang Zeng, He Zhang, Ye Yang, Ying Zou, and Weifeng Sun. 2021. Pull Request Decision Explained: An Empirical Overview.arXiv preprint arXiv:2105.13970 (2021). https://arxiv.org/abs/2105.13970