arxiv: 2603.23448 · v3 · submitted 2026-03-24 · 💻 cs.SE · cs.AI

Recognition: unknown

Code Review Agent Benchmark

Yuntong Zhang , Zhiyuan Pan , Imam Nur Bani Yusuf , Haifeng Ruan , Ridwan Shariffdeen , Abhik Roychoudhury

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:17 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code reviewAI agentsbenchmarkpull requestssoftware engineeringcode qualityagent evaluation

0 comments

The pith

Existing code review agents solve only around 40% of tasks in the c-CRAB benchmark built from human reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper curates a benchmark called c-CRAB to evaluate AI agents on code review tasks for pull requests. It constructs the dataset by deriving tests directly from human reviews of real pull requests, then measures how well agents like PR-agent, Devin, Claude Code, and Codex can produce reviews that pass those tests. The evaluation finds that agents collectively address only about 40% of the tasks. Agent reviews also tend to examine different aspects of the code than human reviews do. This setup provides both a performance measure and a held-out test suite for assessing automated review quality.

Core claim

The c-CRAB dataset, built by generating tests from human reviews of pull requests, shows that current code review agents solve only around 40% of its tasks while often focusing on different aspects than humans, which indicates both a performance gap for future work and opportunities for human-agent collaboration in code review.

What carries the argument

c-CRAB benchmark, which turns human pull-request reviews into tests that score whether an agent's generated review catches the same issues.

If this is right

Future agents can target the remaining 60% of tasks to narrow the performance gap.
Complementary focus areas between agents and humans enable hybrid review processes.
The generated tests function as an independent quality gate for any agent review.
Combining review agents with code-generation and test-generation agents could create end-to-end automated quality pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be expanded to more languages and project scales to test broader applicability.
Hybrid human-agent systems might reduce reviewer workload while maintaining higher coverage than either alone.
Standardized benchmarks like this one could accelerate development of reliable AI tools for large-scale software maintenance.

Load-bearing premise

Tests derived from human reviews of pull requests give a reliable and complete measure of review quality.

What would settle it

A new agent that passes substantially more than 40% of c-CRAB tasks, or evidence that agent and human review differences lead to conflicting rather than additive results on actual codebases.

Figures

Figures reproduced from arXiv: 2603.23448 by Abhik Roychoudhury, Haifeng Ruan, Imam Nur Bani Yusuf, Ridwan Shariffdeen, Yuntong Zhang, Zhiyuan Pan.

**Figure 1.** Figure 1: The line of code under review. At this point we haven’t yet established that the rows are sequences as well. check_keyboard_type([1]) would raise an exception with this addition IISC. This comment identifies a robustness issue: the implementation assumes that nested elements are sequences and may raise an exception on malformed inputs (e.g., [1]) instead of safely rejecting them. Now consider the review g… view at source ↗

**Figure 3.** Figure 3: The test ensuring check_keyboard_type([1]) returns False instead of raising an exception. role in code review, and previous LLM-as-a-Judge methods cannot distinguish between the above two review comments. 3.3 Our benchmark: c-CRAB The key idea of c-CRAB is to evaluate review comments based on whether they lead to behaviorally correct fixes. During evaluation, we provide the generated review to a coding ag… view at source ↗

**Figure 2.** Figure 2: The line of code under review. Let us consider another pull request from posthog3 . As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Idea behind Benchmark Construction. 4.1 c-CRAB Overview The challenge in evaluating code review lies in the fact that the value of a review comment is not determined by how closely it matches human wording, but by whether it identifies an issue whose resolution improves the code. This observation motivates the design of c-CRAB, a benchmark that evaluates review comments based on whether they lead to verifi… view at source ↗

**Figure 5.** Figure 5: Overlap of passed test for each review tool. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Categories of review comments generated by four tools and comparison with human reviews. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Pass rate of automated review tools grouped by the category of the executable tests. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically -- the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases -- the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today -- the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews -- given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews -- indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents -- remains to be investigated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

c-CRAB turns human PR reviews into tests and shows current agents solve only about 40% while missing some human focus areas.

read the letter

The paper's core contribution is a new benchmark dataset, c-CRAB, built by extracting tests directly from existing human code reviews on pull requests. This lets them score review agents on concrete, held-out checks rather than vague quality judgments. They run it on open-source PR-agent plus commercial ones from Devin, Claude, and Codex, and report the combined solve rate sits around 40 percent. They also note that agent comments often hit different aspects than the human ones, which points to possible collaboration setups later on.

Referee Report

1 major / 2 minor

Summary. The paper introduces c-CRAB, a benchmark dataset for code review agents constructed systematically from human pull-request reviews. Tests are generated from human comments to evaluate agent-generated reviews, with evaluations of the open-source PR-agent and commercial agents from Devin, Claude Code, and Codex. Key results include an aggregate solve rate of around 40% and the observation that agent reviews often address different aspects than human reviews, suggesting potential for human-agent collaboration.

Significance. If the test-generation and evaluation framework is valid, this provides a timely empirical benchmark quantifying current limitations in AI code review agents amid rising volumes of auto-generated code. The 40% solve rate and complementarity finding offer concrete guidance for closing gaps and designing hybrid workflows, while the held-out test-suite framing supplies a practical quality gate for integrated agent systems.

major comments (1)

[Dataset Construction] The translation from human review comments to evaluation tests is central to the benchmark's reliability (see dataset construction description). The manuscript should specify the exact mapping rules or provide concrete examples of test generation to demonstrate that the tests faithfully capture review quality without introducing artifacts or coverage gaps.

minor comments (2)

Add a table with per-agent solve rates and task counts to support the aggregate 40% claim and enable direct comparison.
Include version numbers or access dates for the commercial agents (Devin, Claude Code, Codex) to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and for the constructive comment on dataset construction. We address the point below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Dataset Construction] The translation from human review comments to evaluation tests is central to the benchmark's reliability (see dataset construction description). The manuscript should specify the exact mapping rules or provide concrete examples of test generation to demonstrate that the tests faithfully capture review quality without introducing artifacts or coverage gaps.

Authors: We agree that concrete examples are essential to demonstrate the fidelity of the test-generation process. In the revised manuscript we will add a new subsection (with accompanying examples) that walks through the mapping from specific human review comments to the corresponding evaluation tests. For instance, a human comment flagging a missing input validation will be shown to generate a test that checks whether an agent review identifies the missing check and suggests an appropriate fix. These examples will be drawn directly from the c-CRAB construction pipeline and will illustrate that the tests preserve the semantic intent of the original human comments without introducing extraneous artifacts or systematic coverage gaps. We believe this addition will fully address the concern while keeping the paper concise. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark paper whose central claims follow directly from dataset construction and agent evaluation. The c-CRAB tasks are generated from human PR reviews via systematic test creation; the ~40% solve rate and observed differences in review focus are measured outcomes on that held-out suite. No equations, parameter fitting, self-referential predictions, or load-bearing self-citations appear in the derivation chain. The evaluation logic is self-contained once the human-review-to-test mapping is granted, with no reduction of results to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on treating human reviews as the reference standard for generating evaluation tests and assuming the resulting tests capture review quality adequately. No free parameters or invented physical entities are involved.

axioms (1)

domain assumption Human reviews of pull requests constitute a reliable gold standard for assessing code review quality.
The benchmark construction explicitly starts from human reviews to generate tests that evaluate agent reviews.

invented entities (1)

c-CRAB dataset no independent evidence
purpose: Benchmark for evaluating code review agents via tests derived from human reviews
Newly curated dataset introduced in the paper.

pith-pipeline@v0.9.0 · 5638 in / 1282 out tokens · 48678 ms · 2026-05-15T00:17:49.091454+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

Reference graph

Works this paper leans on

31 extracted references · 15 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Maria Antoniak and David Mimno. 2018. Evaluating the Stability of Embedding- based Word Similarities.Trans. Assoc. Comput. Linguistics6 (2018), 107–119

2018
[2]

Hanyang Guo, Xunjin Zheng, Zihan Liao, Hang Yu, Peng Di, Ziyin Zhang, and Hong-Ning Dai. 2025. CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects.CoRR abs/2509.14856 (2025)

work page arXiv 2025
[3]

Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu

Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap.CoRRabs/2509.06216 (2025)

work page arXiv 2025
[4]

Ruida Hu, Xinchen Wang, Xin-Cheng Wen, Zhao Zhang, Bo Jiang, Pengfei Gao, Chao Peng, and Cuiyun Gao. 2025. Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice.CoRRabs/2511.07017 (2025)

work page arXiv 2025
[5]

Yanjie Jiang, Hui Liu, Tianyi Chen, Fu Fan, Chunhao Dong, Kui Liu, and Lu Zhang. 2025. Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity.CoRRabs/2501.05176 (2025)

work page arXiv 2025
[6]

Shuochuan Li, Dong Wang, Patanamon Thongtanunam, Zan Wang, Jiuqiao Yu, and Junjie Chen. 2025. Issue-Oriented Agent-Based Framework for Automated Review Comment Generation.CoRRabs/2511.00517 (2025)

work page arXiv 2025
[7]

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. 2025. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation. InACL (1). Association for Computational Linguistics, 17160–17176

2025
[8]

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundare- san. 2022. Automating code review activities by large-scale pre-training. In ESEC/SIGSOFT FSE. ACM, 1035–1047

2022
[9]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/

2004
[10]

Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun Zhang, and Chun Zuo. 2025. DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation. InFASE (Lecture Notes in Computer Science, Vol. 15693). Springer, 43–64

2025
[11]

Zhang, Sebas- tian Baltes, and Christoph Treude

Jai Lal Lulla, Seyedmoein Mohsenimofidi, Matthias Galster, Jie M. Zhang, Sebas- tian Baltes, and Christoph Treude. 2026. On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents.CoRRabs/2601.20404 (2026). arXiv:2601.20404 doi:10.48550/ARXIV.2601.20404

work page doi:10.48550/arxiv.2601.20404 2026
[12]

Atharva Naik, Marcus Alenius, Daniel Fried, and Carolyn P. Rosé. 2025. CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells. InNAACL (Long Papers). Association for Computational Linguistics, 9049–9076

2025
[13]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InACL. ACL, 311–318

2002
[14]

Kristen Pereira, Neelabh Sinha, Rajat Ghosh, and Debojyoti Dutta. 2026. CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents. arXiv:2603.11078 [cs.SE] https://arxiv.org/abs/2603.11078

work page arXiv 2026
[15]

Dung Pham and Taher A. Ghaleb. 2026. Code Change Characteristics and Descrip- tion Alignment: A Comparative Study of Agentic versus Human Pull Requests. CoRRabs/2601.17627 (2026)

work page arXiv 2026
[16]

Maja Popovic. 2015. chrF: character n-gram F-score for automatic MT evaluation. InWMT@EMNLP. The Association for Computer Linguistics, 392–395

2015
[17]

Zeeshan Rasheed, Malik Abdul Sami, Muhammad Waseem, Kai-Kristian Kemell, Xiaofeng Wang, Anh Nguyen, Kari Systä, and Pekka Abrahamsson. 2024. AI- powered Code Review with LLMs: Early Results.CoRRabs/2404.18496 (2024)

work page arXiv 2024
[18]

Agentic Much? Adoption of Coding Agents on GitHub

Romain Robbes, Théo Matricon, Thomas Degueule, André C. Hora, and Stefano Zacchiroli. 2026. Agentic Much? Adoption of Coding Agents on GitHub.CoRR abs/2601.18341 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush Vosoughi. 2025. Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. InIJCNLP-AACL (long papers). The Asian Federation of Natural Language Processing and The Association for Computational Linguistics, 292– 314

2025
[20]

Bissyandé

Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and Tegawendé F. Bissyandé. 2024. CodeAgent: Autonomous Communicative Agents for Code Review. InEMNLP. Association for Computa- tional Linguistics, 11279–11313

2024
[21]

Chakkrit Kla Tantithamthavorn, Yaotian Zou, Andy Wong, Michael Gupta, Zhe Wang, Mike Buller, Ryan Jiang, Matthew Watson, Minwoo Jeong, Kun Chen, and Ming Wu. 2026. RovoDev Code Reviewer: A Large-Scale Online Evaluation of LLM-based Code Review Automation at Atlassian.CoRRabs/2601.01129 (2026)

work page arXiv 2026
[22]

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2024. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges.CoRRabs/2406.12624 (2024). arXiv:2406.12624 doi:10.48550/ARXIV.2406.12624

work page doi:10.48550/arxiv.2406.12624 2024
[23]

Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, and Gabriele Bavota. 2022. Using Pre-Trained Models to Boost Code Review Automation. InICSE. ACM, 2291–2302

2022
[24]

Rosalia Tufano, Luca Pascarella, Michele Tufano, Denys Poshyvanyk, and Gabriele Bavota. 2021. Towards Automating Code Review Activities. InICSE. IEEE, 163–174

2021
[25]

Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, and Kai Chen
[26]

InACL (Findings) (Findings of ACL, Vol

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution. InACL (Findings) (Findings of ACL, Vol. ACL 2025). Association for Computational Linguistics, 1123–1139

2025
[27]

2026.How We Built a Real-World Bench- mark for AI Code Review

Tomer Yanay and Bar Fingerman. 2026.How We Built a Real-World Bench- mark for AI Code Review. https://www.qodo.ai/blog/how-we-built-a-real-world- benchmark-for-ai-code-review/ Qodo blog post

2026
[28]

Zhengran Zeng, Ruikai Shi, Keke Han, Yixin Li, Kaicheng Sun, Yidong Wang, Zhuohao Yu, Rui Xie, Wei Ye, and Shikun Zhang. 2025. Benchmarking and Studying the LLM-based Code Review.CoRRabs/2509.01494 (2025)

work page arXiv 2025
[29]

Lei Zhang, Yongda Yu, Minghui Yu, Xinxin Guo, Zhengqi Zhuang, Guoping Rong, Dong Shao, Haifeng Shen, Hongyu Kuang, Zhengfeng Li, Boge Wang, Guoan Zhang, Bangyu Xiang, and Xiaobin Xu. 2026. AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context. arXiv:2601.19494 [cs.SE] https://arxiv.org/abs/2601.19494

work page arXiv 2026
[30]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InISSTA. ACM, 1592–1604

2024
[31]

Qixing Zhou, Jiacheng Zhang, Haiyang Wang, Rui Hao, Jiahe Wang, Minghao Han, Yuxue Yang, Shuzhe Wu, Feiyang Pan, Lue Fan, Dandan Tu, and Zhaoxiang Zhang. 2026. FeatureBench: Benchmarking Agentic Coding for Complex Feature Development. arXiv:2602.10975 [cs.SE] https://arxiv.org/abs/2602.10975 11

work page arXiv 2026