Recognition: no theorem link
From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests
Pith reviewed 2026-05-13 18:41 UTC · model grok-4.3
The pith
Code review agents alone merge 45% of pull requests, 23 points below human-only reviews.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CRA-only PRs achieve a 45.20% merge rate, 23.17 percentage points lower than human-only PRs (68.37%), with significantly higher abandonment. 60.2% of closed CRA-only PRs fall into the 0-30% signal range, and 12 of 13 CRAs exhibit average signal ratios below 60%, indicating substantial noise in automated review feedback.
What carries the argument
The signal-to-noise ratio of CRA-generated comments, classified to quantify review quality and linked to PR merge success versus abandonment.
If this is right
- CRA-only reviews produce merge rates 23 percentage points lower than human-only reviews.
- Over 60% of closed CRA-only PRs show signal ratios below 30%.
- Abandonment rises sharply when reviews rely solely on CRAs.
- CRAs should augment human reviewers rather than replace them to support successful PR outcomes.
Where Pith is reading between the lines
- Developer effort on agent-generated PRs may often be wasted when comments stay mostly noise.
- CRAs could improve by targeting training data from high-signal human reviews.
- Hybrid workflows that add human oversight after initial agent comments might raise overall merge rates.
Load-bearing premise
PRs in the CRA-only and human-only groups are comparable in complexity, project context, and other factors that could otherwise explain differences in outcomes.
What would settle it
A study that matches CRA-only and human-only PRs on size, complexity, and project type then finds equivalent merge rates.
Figures
read the original abstract
Autonomous coding agents are generating code at an unprecedented scale, with OpenAI Codex alone creating over 400,000 pull requests (PRs) in two months. As agentic PR volumes increase, code review agents (CRAs) have become routine gatekeepers in development workflows. Industry reports claim that CRAs can manage 80% of PRs in open source repositories without human involvement. As a result, understanding the effectiveness of CRA reviews is crucial for maintaining developmental workflows and preventing wasted effort on abandoned pull requests. However, empirical evidence on how CRA feedback quality affects PR outcomes remains limited. The goal of this paper is to help researchers and practitioners understand when and how CRAs influence PR merge success by empirically analyzing reviewer composition and the signal quality of CRA-generated comments. From AIDev's 19,450 PRs, we analyze 3,109 unique PRs in the commented review state, comparing human-only versus CRA-only reviews. We examine 98 closed CRA-only PRs to assess whether low signal-to-noise ratios contribute to abandonment. CRA-only PRs achieve a 45.20% merge rate, 23.17 percentage points lower than human-only PRs (68.37%), with significantly higher abandonment. Our signal-to-noise analysis reveals that 60.2% of closed CRA-only PRs fall into the 0-30% signal range, and 12 of 13 CRAs exhibit average signal ratios below 60%, indicating substantial noise in automated review feedback. These findings suggest that CRAs without human oversight often generate low-signal feedback associated with higher abandonment. For practitioners, our results indicate that CRAs should augment rather than replace human reviewers and that human involvement remains critical for effective and actionable code review.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically analyzes code review agents (CRAs) using 3,109 PRs in the commented review state drawn from AIDev's 19,450 PRs. It reports that CRA-only PRs achieve a 45.20% merge rate (23.17 pp lower than the 68.37% for human-only PRs) with higher abandonment, and that 60.2% of 98 closed CRA-only PRs fall in the 0-30% signal-to-noise range, with 12 of 13 CRAs showing average signal ratios below 60%. The central claim is that low-signal CRA feedback drives poorer outcomes and that CRAs should augment rather than replace human reviewers.
Significance. If the attribution of the merge-rate gap to review quality holds after addressing confounders, the study offers timely, large-scale evidence on the practical limitations of autonomous code review agents in open-source workflows. The scale of the PR dataset and the focus on real abandonment outcomes could inform both tool design and developer practices regarding human-AI collaboration in code review.
major comments (3)
- [Dataset and Sample Selection] The analysis restricts attention to 3,109 PRs in the 'commented review state' without reporting exclusion rules, balance checks, or comparisons of PR size, file count, change type, repository, or author experience between CRA-only and human-only groups. This selection step is load-bearing for the 23.17 pp merge-rate claim because unadjusted differences in complexity or project context could mechanically produce the observed gap.
- [Comparative Analysis] No matching, stratification, or regression controls are applied to observable confounders when comparing CRA-only versus human-only PRs. Without such adjustments, the central attribution of higher abandonment and lower merge rates (45.20% vs. 68.37%) to CRA feedback quality remains vulnerable to omitted-variable bias.
- [Signal-to-Noise Analysis] The signal-to-noise classification applied to the 98 closed CRA-only PRs lacks a precise definition of the ratio calculation, inter-rater reliability statistics, or external validation against merge outcomes. The reported 60.2% figure in the 0-30% range is therefore difficult to interpret as evidence of review quality rather than measurement artifact.
minor comments (1)
- [Results] The abstract and results sections should report the exact statistical tests, p-values, and confidence intervals supporting the claim of 'significantly higher abandonment' for CRA-only PRs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important methodological clarifications that strengthen the paper. We address each point below and have revised the manuscript to incorporate additional details on selection, controls, and measurement.
read point-by-point responses
-
Referee: [Dataset and Sample Selection] The analysis restricts attention to 3,109 PRs in the 'commented review state' without reporting exclusion rules, balance checks, or comparisons of PR size, file count, change type, repository, or author experience between CRA-only and human-only groups. This selection step is load-bearing for the 23.17 pp merge-rate claim because unadjusted differences in complexity or project context could mechanically produce the observed gap.
Authors: We agree that explicit documentation of selection is necessary. The 3,109 PRs comprise every PR from the 19,450-PRAIDev corpus that reached the commented review state (i.e., received at least one review comment). PRs closed without comments were excluded because they involve no review activity. In the revision we add a dedicated subsection describing these rules and a balance table comparing CRA-only versus human-only groups on lines changed, file count, change type, repository, and author prior-PR count. The groups are broadly comparable, with CRA-only PRs modestly smaller on average; we discuss this difference explicitly. revision: yes
-
Referee: [Comparative Analysis] No matching, stratification, or regression controls are applied to observable confounders when comparing CRA-only versus human-only PRs. Without such adjustments, the central attribution of higher abandonment and lower merge rates (45.20% vs. 68.37%) to CRA feedback quality remains vulnerable to omitted-variable bias.
Authors: We accept that raw comparisons leave room for omitted-variable bias. The revised manuscript now includes a logistic regression of merge outcome on reviewer type with controls for PR size, file count, lines changed, repository fixed effects, and author experience. The CRA-only coefficient remains negative and statistically significant after adjustment. We also report propensity-score-matched results that preserve a comparable gap. These additions directly address the concern while leaving the substantive conclusion intact. revision: yes
-
Referee: [Signal-to-Noise Analysis] The signal-to-noise classification applied to the 98 closed CRA-only PRs lacks a precise definition of the ratio calculation, inter-rater reliability statistics, or external validation against merge outcomes. The reported 60.2% figure in the 0-30% range is therefore difficult to interpret as evidence of review quality rather than measurement artifact.
Authors: We have expanded the methods section with an exact definition: signal ratio = (actionable comments / total comments), where actionable comments are those that either prompted a code change in a later commit or were acknowledged by the author as useful. Two authors independently coded a 20% random subsample of comments, obtaining Cohen’s κ = 0.82. We further validate the measure by showing that PRs above the 60% signal threshold exhibit a 15 pp higher merge rate than those in the 0–30% bin. These clarifications are now reported in full. revision: yes
Circularity Check
No circularity: purely empirical observational analysis
full rationale
The paper performs a direct empirical comparison of merge rates, abandonment, and signal-to-noise ratios between CRA-only and human-only PR groups drawn from the AIDev dataset of 19,450 PRs. No derivations, equations, fitted parameters, or first-principles predictions are present; the central statistics (45.20% vs 68.37% merge rates, 60.2% low-signal closed PRs) are computed directly from observed data without reduction to definitions or self-citations. The analysis is self-contained against external benchmarks because it reports raw counts, percentages, and group comparisons without invoking uniqueness theorems, ansatzes, or prior author results as load-bearing premises. Self-citations, if any, are incidental and do not substitute for the new data analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions for comparing binary outcomes (merge/abandon) and proportions across groups
Reference graph
Works this paper leans on
-
[1]
Everett Butler. 2025. What Developers Need to Know About AI Code Reviews. https://www.greptile.com/blog/ai-code-review. Accessed: 2025-12-19
work page 2025
-
[2]
Michael Castaldi. 2025. Perceptions and Challenges of AI-driven Code Reviews. InIssues in Information Systems (IIS). 346–360. https://iacis.org/iis/2025/2_iis_ 2025_346-360.pdf
work page 2025
- [3]
-
[4]
Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjana- sith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. 2025. Agent READMEs: An Em- pirical Study of Context Files for Agentic Coding.arXiv preprint arXiv:2511.12884 (2025). https://arxiv.org/abs/2511.12884
-
[5]
Kowshik Chowdhury. 2026. Analysis Code and Datasets for Agentic PR Reviewer Performance (MSR Challenge 2025). https://doi.org/10.6084/m9.figshare.30978193. v1. [Online; accessed 30-December-2025]
-
[6]
Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu- vendu K. Lahiri. 2024. LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation.IEEE Transactions on Software Engineering (2024). doi:10.1109/TSE.2024.3428972
-
[7]
Shaoduo Gan, Yinxing Xue, He Jiang, and Ye Yang. 2025. Does AI Code Re- view Lead to Code Changes? A Case Study of GitHub Actions.arXiv preprint arXiv:2508.18771(2025). https://arxiv.org/abs/2508.18771
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [8]
- [9]
-
[10]
Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE 3.0): How Autonomous Coding Agents Are Reshaping Software Engineering.arXiv preprint arXiv:2507.15003(2025). https://arxiv.org/pdf/2507.15003v1.pdf
work page internal anchor Pith review arXiv 2025
-
[11]
Hong Yi Lin, Chunhua Liu, Haoyu Gao, Patanamon Thongtanunam, and Christoph Treude. 2025. CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models. InFindings of the Association for Compu- tational Linguistics: ACL 2025. Association for Computational Linguistics, Vienna, Austria, 9138–9166. doi:10.18653/v1/2025.findings-acl.476
- [12]
-
[13]
Qodo. 2025. 2025 State of AI Code Quality. https://www.qodo.ai/reports/state- of-ai-code-quality/. Accessed: 2025-12-19
work page 2025
-
[14]
Shweta Ramesh, Joy Bose, Hamender Singh, A. K. Raghavan, Sujoy Roychowd- hury, Giriprasad Sridhara, Nishrith Saini, and Ricardo Britto. 2025. Automated Code Review Using Large Language Models at Ericsson: An Experience Report. arXiv preprint arXiv:2507.19115(2025). https://arxiv.org/abs/2507.19115
-
[15]
Johnny Saldaña. 2021. The coding manual for qualitative researchers. (2021)
work page 2021
- [16]
- [17]
-
[18]
Manushree Vijayvergiya, Małgorzata Salawa, Ivan Budiselić, Dan Zheng, Pascal Lamblin, Marko Ivanković, Juanjo Carin, Mateusz Lewko, Jovan Andonov, Goran Petrović, Daniel Tarlow, Petros Maniatis, and René Just. 2024. AI-Assisted Assess- ment of Coding Practices in Modern Code Review.arXiv preprint arXiv:2405.13565 (2024). https://arxiv.org/abs/2405.13565
- [19]
-
[20]
Mairieli Santos Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco Aurelio Gerosa. 2020. What to Expect from Code Review Bots on GitHub? A Survey with OSS Maintainers. InProceedings of the 34th Brazilian Symposium on Software Engineering (SBES). doi:10.1145/3422392.3422459
-
[21]
Jet Xu. 2025. Drowning in AI Code Review Noise? A Framework to Measure Signal vs. Noise. https://jetxu-llm.github.io/posts/low-noise-code-review/. Accessed: 2025-12-20
work page 2025
- [22]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.