Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions
Pith reviewed 2026-05-18 21:46 UTC · model grok-4.3
The pith
Concise AI review comments with code snippets are more likely to prompt code changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In an empirical study of 16 AI code review actions, the authors determined that review comments are more likely to result in code changes when they are concise, contain code snippets, are manually triggered, and originate from hunk-level review tools. Adoption of these tools is growing, yet their effectiveness varies substantially depending on these factors.
What carries the argument
A two-stage LLM-assisted framework that classifies whether a review comment has been addressed by later code changes, enabling measurement of effectiveness and factor analysis via interpretable machine learning.
If this is right
- Tool developers should focus on generating concise comments that include specific code examples.
- Enabling manual triggering rather than fully automatic reviews increases the likelihood of changes.
- Hunk-level analysis tools outperform broader review approaches in driving modifications.
- Teams can improve outcomes by selecting and configuring tools based on these comment traits.
Where Pith is reading between the lines
- Teams adopting these tools may achieve better results by prioritizing configurations that support manual triggers.
- The findings could guide design choices for AI review features on platforms beyond GitHub.
- Controlled experiments could test whether changing comment style alone increases address rates.
Load-bearing premise
The two-stage framework correctly identifies which review comments led to actual code changes without substantial misclassification errors.
What would settle it
A manual review of a random sample of 500 comments to check whether the framework's addressed-or-not labels match human judgments on whether developers made related changes.
Figures
read the original abstract
AI-based code review tools automatically review and comment on pull requests to improve code quality. Despite their growing presence, little is known about their actual impact. We present a large-scale empirical study of 16 popular AI-based code review actions for GitHub workflows, analyzing more than 22,000 review comments in 178 repositories. We investigate (1) how these tools are adopted and configured, (2) whether their comments lead to code changes, and (3) which factors influence their effectiveness. We develop a two-stage LLM-assisted framework to determine whether review comments are addressed, and use interpretable machine learning to identify influencing factors. Our findings show that, while adoption is growing, effectiveness varies widely. Comments that are concise, contain code snippets, and are manually triggered, particularly those from hunk-level review tools, are more likely to result in code changes. These results highlight the importance of careful tool design and suggest directions for improving AI-based code review systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a large-scale empirical study of 16 AI-based code review GitHub Actions across 178 repositories and more than 22,000 review comments. It examines tool adoption and configuration, develops a two-stage LLM-assisted framework to label whether each comment was addressed by later commits in the same PR, and applies interpretable machine learning to identify factors (conciseness, presence of code snippets, manual triggering, and hunk-level granularity) that correlate with subsequent code changes. The central claim is that these specific comment and tool characteristics increase the likelihood of code changes.
Significance. If the two-stage LLM labeling step proves reliable, the study supplies timely, actionable evidence on the real-world effectiveness of AI code review tools at GitHub scale. The combination of public data, interpretable ML for factor identification, and concrete design recommendations (favoring concise, snippet-rich, manually triggered hunk-level reviews) would be a useful addition to the empirical software engineering literature on automated review.
major comments (2)
- [§4] §4 (two-stage LLM framework): No validation results—human inter-rater agreement, precision/recall on a held-out labeled sample, or sensitivity analysis—are reported for the classifier that produces the binary 'addressed' label. Because this label is the dependent variable for all effectiveness statistics and the subsequent interpretable-ML feature importances, any systematic misclassification (e.g., the LLM more readily labeling concise or snippet-containing comments as addressed) would directly distort the headline findings.
- [§5] §5 (results on hunk-level tools): The reported advantage for hunk-level review tools rests on accurate detection of code changes after each comment. The manuscript should explicitly describe the diff-based change detection logic and any steps taken to exclude unrelated changes within the same PR; without this, the granularity comparison remains vulnerable to measurement error.
minor comments (2)
- [§3] A summary table listing the 16 tools, their review granularity (file/hunk/line), and default configuration options would improve readability and allow readers to map the reported effectiveness differences back to tool characteristics.
- [Figure 4] The figures showing effectiveness rates by comment length or trigger type would benefit from confidence intervals or bootstrap error bars to convey statistical uncertainty.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of methodological transparency that we have addressed in the revision. Below we respond to each major comment.
read point-by-point responses
-
Referee: [§4] §4 (two-stage LLM framework): No validation results—human inter-rater agreement, precision/recall on a held-out labeled sample, or sensitivity analysis—are reported for the classifier that produces the binary 'addressed' label. Because this label is the dependent variable for all effectiveness statistics and the subsequent interpretable-ML feature importances, any systematic misclassification (e.g., the LLM more readily labeling concise or snippet-containing comments as addressed) would directly distort the headline findings.
Authors: We acknowledge that the original manuscript did not report quantitative validation metrics for the two-stage LLM-assisted labeling framework. While the framework combines LLM classification with rule-based post-processing to improve reliability, we agree that explicit validation is essential given the label's central role. In the revised manuscript we have added a new subsection in §4 that reports: (1) inter-rater agreement (Cohen's κ) between two human annotators on a stratified sample of 400 comments, (2) precision and recall of the LLM labels against the human gold standard, and (3) a sensitivity analysis varying prompt phrasing and temperature. We also discuss potential biases and how the two-stage design mitigates them. These additions directly address the concern about systematic misclassification. revision: yes
-
Referee: [§5] §5 (results on hunk-level tools): The reported advantage for hunk-level review tools rests on accurate detection of code changes after each comment. The manuscript should explicitly describe the diff-based change detection logic and any steps taken to exclude unrelated changes within the same PR; without this, the granularity comparison remains vulnerable to measurement error.
Authors: We agree that a precise description of the change-detection procedure is required for reproducibility and to rule out measurement error in the hunk-level comparison. In the revised §5 we have inserted a dedicated paragraph that details: (i) extraction of file path and line-range information from each review comment, (ii) use of git diff to identify overlapping modifications in subsequent commits of the same PR, and (iii) explicit filtering steps that discard changes occurring in unrelated files or in commits that address other review threads. We believe this clarification strengthens the validity of the granularity findings. revision: yes
Circularity Check
No significant circularity: empirical measurement chain remains independent of its outputs
full rationale
The paper conducts a direct empirical analysis on public GitHub PR data: it collects >22k review comments, applies a separately developed two-stage LLM framework to label whether each comment was addressed by later commits in the same PR, then applies interpretable ML to surface associations with comment features such as conciseness and presence of code snippets. No derivation step reduces the final claims to the inputs by construction, no parameter is fitted on a subset and then relabeled as a prediction, and no load-bearing premise rests on a self-citation whose content is itself unverified. The classification step, while subject to potential error, is an external measurement operation whose correctness is not presupposed by the factor analysis; the reported associations are therefore falsifiable against the raw commit history rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The two-stage LLM-assisted framework can reliably determine whether a review comment leads to a code change.
Forward citations
Cited by 5 Pith papers
-
Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows
Heimdallr detects LLM-induced security risks in GitHub CI workflows by normalizing them into an LLM-Workflow Property Graph and combining triggerability analysis with LLM-assisted dataflow summarization, achieving ove...
-
Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study
Analysis of 9,799 human-reviewed agentic PRs shows only 35.7% of rejections reflect clear agent failures, with 31.2% due to workflow constraints and 33.1% lacking clear rationale, plus notable interaction differences ...
-
On the Footprints of Reviewer Bots Feedback on Agentic Pull Requests in OSS GitHub Repositories
Reviewer bots' higher comment volume on AI agent PRs is associated with slower resolutions and poorer average feedback quality, while feedback quality itself has no association with PR outcomes.
-
From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests
Code review agents achieve 45.20% merge rate on PRs versus 68.37% for humans, with 60.2% of agent-only closed PRs showing 0-30% signal quality.
-
Enhancing Large Language Models with Retrieval Augmented Generation for Software Testing and Inspection Automation
RAG-enhanced LLMs show generally positive effects on automated test generation and code inspection by supplying supplementary context that reduces hallucinations.
Reference graph
Works this paper leans on
-
[1]
Modern code review: a case study at google,
C. Sadowski, E. S ¨oderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern code review: a case study at google,” inProceedings of the 40th international conference on software engineering: Software engineering in practice, 2018, pp. 181–190
work page 2018
-
[2]
Work practices and challenges in pull-based development: The contributor’s perspective,
G. Gousios, M.-A. Storey, and A. Bacchelli, “Work practices and challenges in pull-based development: The contributor’s perspective,” in Proceedings of the 38th international conference on software engi- neering, 2016, pp. 285–296
work page 2016
-
[3]
J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) . IEEE, 2023, pp. 647–658
work page 2023
-
[4]
Github actions: the impact on the pull request process,
M. Wessel, J. Vargovich, M. A. Gerosa, and C. Treude, “Github actions: the impact on the pull request process,” Empirical Software Engineering, vol. 28, no. 6, p. 131, 2023
work page 2023
-
[5]
On the use of github actions in software development repositories,
A. Decan, T. Mens, P. R. Mazrae, and M. Golzadeh, “On the use of github actions in software development repositories,” in 2022 IEEE International Conference on Software Maintenance and Evolution (IC- SME). IEEE, 2022, pp. 235–245
work page 2022
-
[6]
Expectations, outcomes, and challenges of modern code review,
A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” in2013 35th International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 712–721
work page 2013
-
[7]
Characteristics of useful code reviews: An empirical study at microsoft,
A. Bosu, M. Greiler, and C. Bird, “Characteristics of useful code reviews: An empirical study at microsoft,” in 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories . IEEE, 2015, pp. 146–156
work page 2015
-
[8]
What makes a code review useful to opendev developers? an empirical investigation,
A. K. Turzo and A. Bosu, “What makes a code review useful to opendev developers? an empirical investigation,” Empirical Software Engineering, vol. 29, no. 1, p. 6, 2024
work page 2024
-
[9]
Predicting usefulness of code review comments using textual features and developer experience,
M. M. Rahman, C. K. Roy, and R. G. Kula, “Predicting usefulness of code review comments using textual features and developer experience,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 2017, pp. 215–226
work page 2017
-
[10]
Leveraging reviewer experience in code review comment generation,
H. Y . Lin, P. Thongtanunam, C. Treude, M. W. Godfrey, C. Liu, and W. Charoenwet, “Leveraging reviewer experience in code review comment generation,” arXiv preprint arXiv:2409.10959 , 2024
-
[11]
Code reviewing in the trenches: Challenges and best practices,
L. MacLeod, M. Greiler, M.-A. Storey, C. Bird, and J. Czerwonka, “Code reviewing in the trenches: Challenges and best practices,” IEEE Software, vol. 35, no. 4, pp. 34–42, 2017
work page 2017
-
[12]
GitHub - brinnarlyne8585/AIReviewActionAnalysis — github.com,
“GitHub - brinnarlyne8585/AIReviewActionAnalysis — github.com,” https://github.com/brinnarlyne8585/AIReviewActionAnalysis, 2025, to be published on a preserved archive after acceptance, accessed 30-05- 2025
work page 2025
-
[13]
Towards evaluation guidelines for empirical studies involving llms,
S. Wagner, M. M. Bar ´on, D. Falessi, and S. Baltes, “Towards evaluation guidelines for empirical studies involving llms,” in 2nd International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE 2025) , 2025
work page 2025
-
[14]
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
S. Baltes, F. Angermeir, C. Arora, M. M. Bar ´on, C. Chen, L. B ¨ohme, F. Calefato, N. Ernst, D. Falessi, B. Fitzgerald, D. Fucci, M. Kalinowski, S. Lambiase, D. Russo, M. Lungu, L. Prechelt, P. Ralph, C. Treude, and S. Wagner, “Evaluation guidelines for empirical studies in software engineering involving llms,” 2025. [Online]. Available: https://arxiv.or...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
“Definition of “hunk” in the gnu diffutils manual,” https://www.gnu.org/ software/diffutils/manual/html node/Hunks.html, 2025
work page 2025
-
[16]
“Openai,” https://chat.openai.com, 2025
work page 2025
-
[17]
“Gemini,” https://gemini.google.com/, 2025
work page 2025
-
[18]
“Claude,” https://claude.ai, 2025
work page 2025
-
[19]
Testability refactoring in pull requests: Patterns and trends,
P. Reich and W. Maalej, “Testability refactoring in pull requests: Patterns and trends,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20,
work page 2023
- [20]
-
[21]
Bot or not? detecting bots in github pull request activity based on comment similarity,
M. Golzadeh, D. Legay, A. Decan, and T. Mens, “Bot or not? detecting bots in github pull request activity based on comment similarity,” in Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops , 2020, pp. 31–35
work page 2020
- [22]
-
[23]
A unified approach to interpreting model predictions,
S. M. Lundberg and S. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , 2017, pp. 4765–4774
work page 2017
-
[24]
T. Dey and A. Mockus, “Effect of technical and social factors on pull request quality for the NPM ecosystem,” in ESEM ’20: ACM / IEEE International Symposium on Empirical Software Engineering and Measurement, Bari, Italy, October 5-7, 2020 . ACM, 2020, pp. 11:1– 11:11. [Online]. Available: https://doi.org/10.1145/3382494.3410685
-
[25]
D. M. Blei, A. Y . Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research , vol. 3, no. Jan, pp. 993–1022, 2003
work page 2003
-
[26]
Predicting good configurations for github and stack overflow topic models,
C. Treude and M. Wagner, “Predicting good configurations for github and stack overflow topic models,” in Proceedings of the 16th Interna- tional Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada . IEEE / ACM, 2019, pp. 84–95
work page 2019
-
[27]
Autospearman: Automatically mitigating correlated software metrics for interpreting defect models,
J. Jiarpakdee, C. Tantithamthavorn, and C. Treude, “Autospearman: Automatically mitigating correlated software metrics for interpreting defect models,” in 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, September 23-29, 2018. IEEE Computer Society, 2018, pp. 92–103
work page 2018
-
[28]
Automated code review in practice,
U. Cihan, V . Haratian, A. ˙Ic ¸¨oz, M. K. G ¨ul, ¨Omercan Devran, E. F. Bayendur, B. M. Uc ¸ar, and E. T ¨uz¨un, “Automated code review in practice,” 2024. [Online]. Available: https://arxiv.org/abs/2412.18531
-
[29]
Y . Wang, “Language matters,” in 2015 ACM/IEEE International Sym- posium on Empirical Software Engineering and Measurement (ESEM) , 2015, pp. 1–10
work page 2015
-
[30]
Linevul: A transformer-based line- level vulnerability prediction,
M. Fu and C. Tantithamthavorn, “Linevul: A transformer-based line- level vulnerability prediction,” in Proceedings of the 19th International Conference on Mining Software Repositories , 2022, pp. 608–620
work page 2022
-
[31]
Vulre- pair: a t5-based automated software vulnerability repair,
M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung, “Vulre- pair: a t5-based automated software vulnerability repair,” in Proceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering , 2022, pp. 935– 947
work page 2022
-
[32]
X. Zhou, K. Kim, B. Xu, D. Han, and D. Lo, “Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering , 2024, pp. 1–13
work page 2024
-
[33]
Vision transformer inspired automated vulnerability repair,
M. Fu, V . Nguyen, C. Tantithamthavorn, D. Phung, and T. Le, “Vision transformer inspired automated vulnerability repair,” ACM Transactions on Software Engineering and Methodology , vol. 33, no. 3, pp. 1–29, 2024
work page 2024
-
[34]
An empirical study on learning bug-fixing patches in the wild via neural machine translation,
M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and D. Poshyvanyk, “An empirical study on learning bug-fixing patches in the wild via neural machine translation,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 28, no. 4, pp. 1–29, 2019
work page 2019
-
[35]
Cure: Code-aware neural machine translation for automatic program repair,
N. Jiang, T. Lutellier, and L. Tan, “Cure: Code-aware neural machine translation for automatic program repair,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 1161–1173
work page 2021
-
[36]
Inferfix: End-to-end program repair with llms,
M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1646–1656
work page 2023
-
[37]
B. Ray, V . Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. De- vanbu, “On the” naturalness” of buggy code,” inProceedings of the 38th International Conference on Software Engineering , 2016, pp. 428–439
work page 2016
-
[38]
On the naturalness of software,
A. Hindle, E. T. Barr, M. Gabel, Z. Su, and P. Devanbu, “On the naturalness of software,” Communications of the ACM , vol. 59, no. 5, pp. 122–131, 2016
work page 2016
-
[39]
A survey of machine learning for big code and naturalness,
M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,” ACM Computing Surveys (CSUR), vol. 51, no. 4, pp. 1–37, 2018
work page 2018
-
[40]
The code review comprehension assessment for large language models,
H. Y . Lin, C. Liu, H. Gao, P. Thongtanunam, and C. Treude, “The code review comprehension assessment for large language models,” in Findings of the Association for Computational Linguistics: ACL 2025 , 2025
work page 2025
-
[41]
Automating code review activities by large-scale pre-training,
Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, A. Svyatkovskiy, S. Fu et al. , “Automating code review activities by large-scale pre-training,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2022, pp. 1035–1047
work page 2022
-
[42]
Towards automating code review at scale,
V . J. Hellendoorn, J. Tsay, M. Mukherjee, and M. Hirzel, “Towards automating code review at scale,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2021, pp. 1479–1482
work page 2021
-
[43]
Using pre-trained models to boost code review automa- tion,
R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk, and G. Bavota, “Using pre-trained models to boost code review automa- tion,” in Proceedings of the 44th international conference on software engineering, 2022, pp. 2291–2302. 14
work page 2022
-
[44]
Auger: automatically generating review comments with pre-training models,
L. Li, L. Yang, H. Jiang, J. Yan, T. Luo, Z. Hua, G. Liang, and C. Zuo, “Auger: automatically generating review comments with pre-training models,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1009–1021
work page 2022
-
[45]
Improving automated code reviews: Learning from experience,
H. Y . Lin, P. Thongtanunam, C. Treude, and W. Charoenwet, “Improving automated code reviews: Learning from experience,” in Proceedings of the 21st International Conference on Mining Software Repositories , 2024, pp. 278–283
work page 2024
-
[46]
Towards automated code reviews: Does learning code structure help?
H. Y . Lin and P. Thongtanunam, “Towards automated code reviews: Does learning code structure help?” in 2023 IEEE International Con- ference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 2023, pp. 703–707
work page 2023
-
[47]
Autotrans- form: Automated code transformation to support modern code review process,
P. Thongtanunam, C. Pornprasit, and C. Tantithamthavorn, “Autotrans- form: Automated code transformation to support modern code review process,” inProceedings of the 44th international conference on software engineering, 2022, pp. 237–248
work page 2022
-
[48]
Improving the learning of code review successive tasks with cross-task knowledge distillation,
O. Ben Sghaier and H. Sahraoui, “Improving the learning of code review successive tasks with cross-task knowledge distillation,” Proceedings of the ACM on Software Engineering , vol. 1, no. FSE, pp. 1086–1106, 2024
work page 2024
-
[49]
Cct5: A code- change-oriented pre-trained model,
B. Lin, S. Wang, Z. Liu, Y . Liu, X. Xia, and X. Mao, “Cct5: A code- change-oriented pre-trained model,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2023, pp. 1509–1521
work page 2023
-
[50]
P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experi- ence: Evaluating the usability of code generation tools powered by large language models,” in Chi conference on human factors in computing systems extended abstracts , 2022, pp. 1–7
work page 2022
-
[51]
A comparative study on method comment and inline comment,
Y . Huang, H. Guo, X. Ding, J. Shu, X. Chen, X. Luo, Z. Zheng, and X. Zhou, “A comparative study on method comment and inline comment,” ACM Trans. Softw. Eng. Methodol., vol. 32, no. 5, pp. 126:1– 126:26, 2023. [Online]. Available: https://doi.org/10.1145/3582570
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.