Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions

Christoph Treude; Dong Shao; Guoping Rong; He Zhang; Hongyu Kuang; Kexin Sun; Sebastian Baltes; Xiaoxing Ma; Xin Zhou

arxiv: 2508.18771 · v2 · submitted 2025-08-26 · 💻 cs.SE

Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions

Kexin Sun , Hongyu Kuang , Sebastian Baltes , Xin Zhou , He Zhang , Xiaoxing Ma , Guoping Rong , Dong Shao

show 1 more author

Christoph Treude

This is my paper

Pith reviewed 2026-05-18 21:46 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI code reviewGitHub Actionspull requestsreview commentscode changesempirical studytool effectiveness

0 comments

The pith

Concise AI review comments with code snippets are more likely to prompt code changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether comments from AI-based code review tools on GitHub actually lead developers to modify their code. By examining over 22,000 comments across 178 repositories and 16 tools, the study identifies specific characteristics that increase the chance of changes occurring. Readers would care because these tools are increasingly used in software teams, and the results point to practical ways to make reviews more useful.

Core claim

In an empirical study of 16 AI code review actions, the authors determined that review comments are more likely to result in code changes when they are concise, contain code snippets, are manually triggered, and originate from hunk-level review tools. Adoption of these tools is growing, yet their effectiveness varies substantially depending on these factors.

What carries the argument

A two-stage LLM-assisted framework that classifies whether a review comment has been addressed by later code changes, enabling measurement of effectiveness and factor analysis via interpretable machine learning.

If this is right

Tool developers should focus on generating concise comments that include specific code examples.
Enabling manual triggering rather than fully automatic reviews increases the likelihood of changes.
Hunk-level analysis tools outperform broader review approaches in driving modifications.
Teams can improve outcomes by selecting and configuring tools based on these comment traits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams adopting these tools may achieve better results by prioritizing configurations that support manual triggers.
The findings could guide design choices for AI review features on platforms beyond GitHub.
Controlled experiments could test whether changing comment style alone increases address rates.

Load-bearing premise

The two-stage framework correctly identifies which review comments led to actual code changes without substantial misclassification errors.

What would settle it

A manual review of a random sample of 500 comments to check whether the framework's addressed-or-not labels match human judgments on whether developers made related changes.

Figures

Figures reproduced from arXiv: 2508.18771 by Christoph Treude, Dong Shao, Guoping Rong, He Zhang, Hongyu Kuang, Kexin Sun, Sebastian Baltes, Xiaoxing Ma, Xin Zhou.

**Figure 1.** Figure 1: Example comments from a PR-level review action (Integral-Healthcare/robin-ai-reviewer), a file-level review action (anc95/ChatGPT-CodeReview), and a hunk-Level review action (coderabbitai/ai-pr-reviewer). machine learning techniques to model the factors that influence the effectiveness of comments. Our contributions are as follows. • We provide the first systematic study of the adoption and usage of AI-bas… view at source ↗

**Figure 2.** Figure 2: An example of configuring an AI-based code review action. behavior through optional parameters, such as which LLM to use (MODEL), which prompt to apply (PROMPT), and which natural language to comment in (LANGUAGE). How the Action Works: Once the trigger conditions are satisfied (e.g., a pull request is opened), ChatGPT-CodeReview conducts a code review. It compares the pull request’s base commit and the l… view at source ↗

read the original abstract

AI-based code review tools automatically review and comment on pull requests to improve code quality. Despite their growing presence, little is known about their actual impact. We present a large-scale empirical study of 16 popular AI-based code review actions for GitHub workflows, analyzing more than 22,000 review comments in 178 repositories. We investigate (1) how these tools are adopted and configured, (2) whether their comments lead to code changes, and (3) which factors influence their effectiveness. We develop a two-stage LLM-assisted framework to determine whether review comments are addressed, and use interpretable machine learning to identify influencing factors. Our findings show that, while adoption is growing, effectiveness varies widely. Comments that are concise, contain code snippets, and are manually triggered, particularly those from hunk-level review tools, are more likely to result in code changes. These results highlight the importance of careful tool design and suggest directions for improving AI-based code review systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale empirical study of 16 AI-based code review GitHub Actions across 178 repositories and more than 22,000 review comments. It examines tool adoption and configuration, develops a two-stage LLM-assisted framework to label whether each comment was addressed by later commits in the same PR, and applies interpretable machine learning to identify factors (conciseness, presence of code snippets, manual triggering, and hunk-level granularity) that correlate with subsequent code changes. The central claim is that these specific comment and tool characteristics increase the likelihood of code changes.

Significance. If the two-stage LLM labeling step proves reliable, the study supplies timely, actionable evidence on the real-world effectiveness of AI code review tools at GitHub scale. The combination of public data, interpretable ML for factor identification, and concrete design recommendations (favoring concise, snippet-rich, manually triggered hunk-level reviews) would be a useful addition to the empirical software engineering literature on automated review.

major comments (2)

[§4] §4 (two-stage LLM framework): No validation results—human inter-rater agreement, precision/recall on a held-out labeled sample, or sensitivity analysis—are reported for the classifier that produces the binary 'addressed' label. Because this label is the dependent variable for all effectiveness statistics and the subsequent interpretable-ML feature importances, any systematic misclassification (e.g., the LLM more readily labeling concise or snippet-containing comments as addressed) would directly distort the headline findings.
[§5] §5 (results on hunk-level tools): The reported advantage for hunk-level review tools rests on accurate detection of code changes after each comment. The manuscript should explicitly describe the diff-based change detection logic and any steps taken to exclude unrelated changes within the same PR; without this, the granularity comparison remains vulnerable to measurement error.

minor comments (2)

[§3] A summary table listing the 16 tools, their review granularity (file/hunk/line), and default configuration options would improve readability and allow readers to map the reported effectiveness differences back to tool characteristics.
[Figure 4] The figures showing effectiveness rates by comment length or trigger type would benefit from confidence intervals or bootstrap error bars to convey statistical uncertainty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of methodological transparency that we have addressed in the revision. Below we respond to each major comment.

read point-by-point responses

Referee: [§4] §4 (two-stage LLM framework): No validation results—human inter-rater agreement, precision/recall on a held-out labeled sample, or sensitivity analysis—are reported for the classifier that produces the binary 'addressed' label. Because this label is the dependent variable for all effectiveness statistics and the subsequent interpretable-ML feature importances, any systematic misclassification (e.g., the LLM more readily labeling concise or snippet-containing comments as addressed) would directly distort the headline findings.

Authors: We acknowledge that the original manuscript did not report quantitative validation metrics for the two-stage LLM-assisted labeling framework. While the framework combines LLM classification with rule-based post-processing to improve reliability, we agree that explicit validation is essential given the label's central role. In the revised manuscript we have added a new subsection in §4 that reports: (1) inter-rater agreement (Cohen's κ) between two human annotators on a stratified sample of 400 comments, (2) precision and recall of the LLM labels against the human gold standard, and (3) a sensitivity analysis varying prompt phrasing and temperature. We also discuss potential biases and how the two-stage design mitigates them. These additions directly address the concern about systematic misclassification. revision: yes
Referee: [§5] §5 (results on hunk-level tools): The reported advantage for hunk-level review tools rests on accurate detection of code changes after each comment. The manuscript should explicitly describe the diff-based change detection logic and any steps taken to exclude unrelated changes within the same PR; without this, the granularity comparison remains vulnerable to measurement error.

Authors: We agree that a precise description of the change-detection procedure is required for reproducibility and to rule out measurement error in the hunk-level comparison. In the revised §5 we have inserted a dedicated paragraph that details: (i) extraction of file path and line-range information from each review comment, (ii) use of git diff to identify overlapping modifications in subsequent commits of the same PR, and (iii) explicit filtering steps that discard changes occurring in unrelated files or in commits that address other review threads. We believe this clarification strengthens the validity of the granularity findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical measurement chain remains independent of its outputs

full rationale

The paper conducts a direct empirical analysis on public GitHub PR data: it collects >22k review comments, applies a separately developed two-stage LLM framework to label whether each comment was addressed by later commits in the same PR, then applies interpretable ML to surface associations with comment features such as conciseness and presence of code snippets. No derivation step reduces the final claims to the inputs by construction, no parameter is fitted on a subset and then relabeled as a prediction, and no load-bearing premise rests on a self-citation whose content is itself unverified. The classification step, while subject to potential error, is an external measurement operation whose correctness is not presupposed by the factor analysis; the reported associations are therefore falsifiable against the raw commit history rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central effectiveness claims rest on the accuracy of the LLM-based change detection method and the assumption that the sampled repositories and tools are representative of broader AI code review usage.

axioms (1)

domain assumption The two-stage LLM-assisted framework can reliably determine whether a review comment leads to a code change.
This classification step is required to measure the outcome variable but its error rate is not quantified in the abstract.

pith-pipeline@v0.9.0 · 5720 in / 1261 out tokens · 67071 ms · 2026-05-18T21:46:24.530422+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows
cs.CR 2026-05 unverdicted novelty 8.0

Heimdallr detects LLM-induced security risks in GitHub CI workflows by normalizing them into an LLM-Workflow Property Graph and combining triggerability analysis with LLM-assisted dataflow summarization, achieving ove...
Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study
cs.SE 2026-05 unverdicted novelty 6.0

Analysis of 9,799 human-reviewed agentic PRs shows only 35.7% of rejections reflect clear agent failures, with 31.2% due to workflow constraints and 33.1% lacking clear rationale, plus notable interaction differences ...
On the Footprints of Reviewer Bots Feedback on Agentic Pull Requests in OSS GitHub Repositories
cs.SE 2026-04 unverdicted novelty 6.0

Reviewer bots' higher comment volume on AI agent PRs is associated with slower resolutions and poorer average feedback quality, while feedback quality itself has no association with PR outcomes.
From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests
cs.SE 2026-04 conditional novelty 5.0

Code review agents achieve 45.20% merge rate on PRs versus 68.37% for humans, with 60.2% of agent-only closed PRs showing 0-30% signal quality.
Enhancing Large Language Models with Retrieval Augmented Generation for Software Testing and Inspection Automation
cs.SE 2026-04 unverdicted novelty 3.0

RAG-enhanced LLMs show generally positive effects on automated test generation and code inspection by supplying supplementary context that reduces hallucinations.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 5 Pith papers · 1 internal anchor

[1]

Modern code review: a case study at google,

C. Sadowski, E. S ¨oderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern code review: a case study at google,” inProceedings of the 40th international conference on software engineering: Software engineering in practice, 2018, pp. 181–190

work page 2018
[2]

Work practices and challenges in pull-based development: The contributor’s perspective,

G. Gousios, M.-A. Storey, and A. Bacchelli, “Work practices and challenges in pull-based development: The contributor’s perspective,” in Proceedings of the 38th international conference on software engi- neering, 2016, pp. 285–296

work page 2016
[3]

Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,

J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) . IEEE, 2023, pp. 647–658

work page 2023
[4]

Github actions: the impact on the pull request process,

M. Wessel, J. Vargovich, M. A. Gerosa, and C. Treude, “Github actions: the impact on the pull request process,” Empirical Software Engineering, vol. 28, no. 6, p. 131, 2023

work page 2023
[5]

On the use of github actions in software development repositories,

A. Decan, T. Mens, P. R. Mazrae, and M. Golzadeh, “On the use of github actions in software development repositories,” in 2022 IEEE International Conference on Software Maintenance and Evolution (IC- SME). IEEE, 2022, pp. 235–245

work page 2022
[6]

Expectations, outcomes, and challenges of modern code review,

A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” in2013 35th International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 712–721

work page 2013
[7]

Characteristics of useful code reviews: An empirical study at microsoft,

A. Bosu, M. Greiler, and C. Bird, “Characteristics of useful code reviews: An empirical study at microsoft,” in 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories . IEEE, 2015, pp. 146–156

work page 2015
[8]

What makes a code review useful to opendev developers? an empirical investigation,

A. K. Turzo and A. Bosu, “What makes a code review useful to opendev developers? an empirical investigation,” Empirical Software Engineering, vol. 29, no. 1, p. 6, 2024

work page 2024
[9]

Predicting usefulness of code review comments using textual features and developer experience,

M. M. Rahman, C. K. Roy, and R. G. Kula, “Predicting usefulness of code review comments using textual features and developer experience,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 2017, pp. 215–226

work page 2017
[10]

Leveraging reviewer experience in code review comment generation,

H. Y . Lin, P. Thongtanunam, C. Treude, M. W. Godfrey, C. Liu, and W. Charoenwet, “Leveraging reviewer experience in code review comment generation,” arXiv preprint arXiv:2409.10959 , 2024

work page arXiv 2024
[11]

Code reviewing in the trenches: Challenges and best practices,

L. MacLeod, M. Greiler, M.-A. Storey, C. Bird, and J. Czerwonka, “Code reviewing in the trenches: Challenges and best practices,” IEEE Software, vol. 35, no. 4, pp. 34–42, 2017

work page 2017
[12]

GitHub - brinnarlyne8585/AIReviewActionAnalysis — github.com,

“GitHub - brinnarlyne8585/AIReviewActionAnalysis — github.com,” https://github.com/brinnarlyne8585/AIReviewActionAnalysis, 2025, to be published on a preserved archive after acceptance, accessed 30-05- 2025

work page 2025
[13]

Towards evaluation guidelines for empirical studies involving llms,

S. Wagner, M. M. Bar ´on, D. Falessi, and S. Baltes, “Towards evaluation guidelines for empirical studies involving llms,” in 2nd International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE 2025) , 2025

work page 2025
[14]

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

S. Baltes, F. Angermeir, C. Arora, M. M. Bar ´on, C. Chen, L. B ¨ohme, F. Calefato, N. Ernst, D. Falessi, B. Fitzgerald, D. Fucci, M. Kalinowski, S. Lambiase, D. Russo, M. Lungu, L. Prechelt, P. Ralph, C. Treude, and S. Wagner, “Evaluation guidelines for empirical studies in software engineering involving llms,” 2025. [Online]. Available: https://arxiv.or...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Definition of “hunk

“Definition of “hunk” in the gnu diffutils manual,” https://www.gnu.org/ software/diffutils/manual/html node/Hunks.html, 2025

work page 2025
[16]

“Openai,” https://chat.openai.com, 2025

work page 2025
[17]

“Gemini,” https://gemini.google.com/, 2025

work page 2025
[18]

“Claude,” https://claude.ai, 2025

work page 2025
[19]

Testability refactoring in pull requests: Patterns and trends,

P. Reich and W. Maalej, “Testability refactoring in pull requests: Patterns and trends,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20,

work page 2023
[20]

1508–1519

IEEE, 2023, pp. 1508–1519

work page 2023
[21]

Bot or not? detecting bots in github pull request activity based on comment similarity,

M. Golzadeh, D. Legay, A. Decan, and T. Mens, “Bot or not? detecting bots in github pull request activity based on comment similarity,” in Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops , 2020, pp. 31–35

work page 2020
[22]

Deepseek,

“Deepseek,” https://deepseek.com, 2025

work page 2025
[23]

A unified approach to interpreting model predictions,

S. M. Lundberg and S. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , 2017, pp. 4765–4774

work page 2017
[24]

In: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

T. Dey and A. Mockus, “Effect of technical and social factors on pull request quality for the NPM ecosystem,” in ESEM ’20: ACM / IEEE International Symposium on Empirical Software Engineering and Measurement, Bari, Italy, October 5-7, 2020 . ACM, 2020, pp. 11:1– 11:11. [Online]. Available: https://doi.org/10.1145/3382494.3410685

work page doi:10.1145/3382494.3410685 2020
[25]

Latent dirichlet allocation,

D. M. Blei, A. Y . Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research , vol. 3, no. Jan, pp. 993–1022, 2003

work page 2003
[26]

Predicting good configurations for github and stack overflow topic models,

C. Treude and M. Wagner, “Predicting good configurations for github and stack overflow topic models,” in Proceedings of the 16th Interna- tional Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada . IEEE / ACM, 2019, pp. 84–95

work page 2019
[27]

Autospearman: Automatically mitigating correlated software metrics for interpreting defect models,

J. Jiarpakdee, C. Tantithamthavorn, and C. Treude, “Autospearman: Automatically mitigating correlated software metrics for interpreting defect models,” in 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, September 23-29, 2018. IEEE Computer Society, 2018, pp. 92–103

work page 2018
[28]

Automated code review in practice,

U. Cihan, V . Haratian, A. ˙Ic ¸¨oz, M. K. G ¨ul, ¨Omercan Devran, E. F. Bayendur, B. M. Uc ¸ar, and E. T ¨uz¨un, “Automated code review in practice,” 2024. [Online]. Available: https://arxiv.org/abs/2412.18531

work page arXiv 2024
[29]

Language matters,

Y . Wang, “Language matters,” in 2015 ACM/IEEE International Sym- posium on Empirical Software Engineering and Measurement (ESEM) , 2015, pp. 1–10

work page 2015
[30]

Linevul: A transformer-based line- level vulnerability prediction,

M. Fu and C. Tantithamthavorn, “Linevul: A transformer-based line- level vulnerability prediction,” in Proceedings of the 19th International Conference on Mining Software Repositories , 2022, pp. 608–620

work page 2022
[31]

Vulre- pair: a t5-based automated software vulnerability repair,

M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung, “Vulre- pair: a t5-based automated software vulnerability repair,” in Proceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering , 2022, pp. 935– 947

work page 2022
[32]

Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources,

X. Zhou, K. Kim, B. Xu, D. Han, and D. Lo, “Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering , 2024, pp. 1–13

work page 2024
[33]

Vision transformer inspired automated vulnerability repair,

M. Fu, V . Nguyen, C. Tantithamthavorn, D. Phung, and T. Le, “Vision transformer inspired automated vulnerability repair,” ACM Transactions on Software Engineering and Methodology , vol. 33, no. 3, pp. 1–29, 2024

work page 2024
[34]

An empirical study on learning bug-fixing patches in the wild via neural machine translation,

M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and D. Poshyvanyk, “An empirical study on learning bug-fixing patches in the wild via neural machine translation,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 28, no. 4, pp. 1–29, 2019

work page 2019
[35]

Cure: Code-aware neural machine translation for automatic program repair,

N. Jiang, T. Lutellier, and L. Tan, “Cure: Code-aware neural machine translation for automatic program repair,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 1161–1173

work page 2021
[36]

Inferfix: End-to-end program repair with llms,

M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1646–1656

work page 2023
[37]

On the” naturalness

B. Ray, V . Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. De- vanbu, “On the” naturalness” of buggy code,” inProceedings of the 38th International Conference on Software Engineering , 2016, pp. 428–439

work page 2016
[38]

On the naturalness of software,

A. Hindle, E. T. Barr, M. Gabel, Z. Su, and P. Devanbu, “On the naturalness of software,” Communications of the ACM , vol. 59, no. 5, pp. 122–131, 2016

work page 2016
[39]

A survey of machine learning for big code and naturalness,

M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,” ACM Computing Surveys (CSUR), vol. 51, no. 4, pp. 1–37, 2018

work page 2018
[40]

The code review comprehension assessment for large language models,

H. Y . Lin, C. Liu, H. Gao, P. Thongtanunam, and C. Treude, “The code review comprehension assessment for large language models,” in Findings of the Association for Computational Linguistics: ACL 2025 , 2025

work page 2025
[41]

Automating code review activities by large-scale pre-training,

Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, A. Svyatkovskiy, S. Fu et al. , “Automating code review activities by large-scale pre-training,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2022, pp. 1035–1047

work page 2022
[42]

Towards automating code review at scale,

V . J. Hellendoorn, J. Tsay, M. Mukherjee, and M. Hirzel, “Towards automating code review at scale,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2021, pp. 1479–1482

work page 2021
[43]

Using pre-trained models to boost code review automa- tion,

R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk, and G. Bavota, “Using pre-trained models to boost code review automa- tion,” in Proceedings of the 44th international conference on software engineering, 2022, pp. 2291–2302. 14

work page 2022
[44]

Auger: automatically generating review comments with pre-training models,

L. Li, L. Yang, H. Jiang, J. Yan, T. Luo, Z. Hua, G. Liang, and C. Zuo, “Auger: automatically generating review comments with pre-training models,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1009–1021

work page 2022
[45]

Improving automated code reviews: Learning from experience,

H. Y . Lin, P. Thongtanunam, C. Treude, and W. Charoenwet, “Improving automated code reviews: Learning from experience,” in Proceedings of the 21st International Conference on Mining Software Repositories , 2024, pp. 278–283

work page 2024
[46]

Towards automated code reviews: Does learning code structure help?

H. Y . Lin and P. Thongtanunam, “Towards automated code reviews: Does learning code structure help?” in 2023 IEEE International Con- ference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 2023, pp. 703–707

work page 2023
[47]

Autotrans- form: Automated code transformation to support modern code review process,

P. Thongtanunam, C. Pornprasit, and C. Tantithamthavorn, “Autotrans- form: Automated code transformation to support modern code review process,” inProceedings of the 44th international conference on software engineering, 2022, pp. 237–248

work page 2022
[48]

Improving the learning of code review successive tasks with cross-task knowledge distillation,

O. Ben Sghaier and H. Sahraoui, “Improving the learning of code review successive tasks with cross-task knowledge distillation,” Proceedings of the ACM on Software Engineering , vol. 1, no. FSE, pp. 1086–1106, 2024

work page 2024
[49]

Cct5: A code- change-oriented pre-trained model,

B. Lin, S. Wang, Z. Liu, Y . Liu, X. Xia, and X. Mao, “Cct5: A code- change-oriented pre-trained model,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2023, pp. 1509–1521

work page 2023
[50]

Expectation vs. experi- ence: Evaluating the usability of code generation tools powered by large language models,

P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experi- ence: Evaluating the usability of code generation tools powered by large language models,” in Chi conference on human factors in computing systems extended abstracts , 2022, pp. 1–7

work page 2022
[51]

A comparative study on method comment and inline comment,

Y . Huang, H. Guo, X. Ding, J. Shu, X. Chen, X. Luo, Z. Zheng, and X. Zhou, “A comparative study on method comment and inline comment,” ACM Trans. Softw. Eng. Methodol., vol. 32, no. 5, pp. 126:1– 126:26, 2023. [Online]. Available: https://doi.org/10.1145/3582570

work page doi:10.1145/3582570 2023

[1] [1]

Modern code review: a case study at google,

C. Sadowski, E. S ¨oderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern code review: a case study at google,” inProceedings of the 40th international conference on software engineering: Software engineering in practice, 2018, pp. 181–190

work page 2018

[2] [2]

Work practices and challenges in pull-based development: The contributor’s perspective,

G. Gousios, M.-A. Storey, and A. Bacchelli, “Work practices and challenges in pull-based development: The contributor’s perspective,” in Proceedings of the 38th international conference on software engi- neering, 2016, pp. 285–296

work page 2016

[3] [3]

Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,

J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “Llama-reviewer: Advancing code review automation with large language models through parameter- efficient fine-tuning,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) . IEEE, 2023, pp. 647–658

work page 2023

[4] [4]

Github actions: the impact on the pull request process,

M. Wessel, J. Vargovich, M. A. Gerosa, and C. Treude, “Github actions: the impact on the pull request process,” Empirical Software Engineering, vol. 28, no. 6, p. 131, 2023

work page 2023

[5] [5]

On the use of github actions in software development repositories,

A. Decan, T. Mens, P. R. Mazrae, and M. Golzadeh, “On the use of github actions in software development repositories,” in 2022 IEEE International Conference on Software Maintenance and Evolution (IC- SME). IEEE, 2022, pp. 235–245

work page 2022

[6] [6]

Expectations, outcomes, and challenges of modern code review,

A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” in2013 35th International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 712–721

work page 2013

[7] [7]

Characteristics of useful code reviews: An empirical study at microsoft,

A. Bosu, M. Greiler, and C. Bird, “Characteristics of useful code reviews: An empirical study at microsoft,” in 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories . IEEE, 2015, pp. 146–156

work page 2015

[8] [8]

What makes a code review useful to opendev developers? an empirical investigation,

A. K. Turzo and A. Bosu, “What makes a code review useful to opendev developers? an empirical investigation,” Empirical Software Engineering, vol. 29, no. 1, p. 6, 2024

work page 2024

[9] [9]

Predicting usefulness of code review comments using textual features and developer experience,

M. M. Rahman, C. K. Roy, and R. G. Kula, “Predicting usefulness of code review comments using textual features and developer experience,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 2017, pp. 215–226

work page 2017

[10] [10]

Leveraging reviewer experience in code review comment generation,

H. Y . Lin, P. Thongtanunam, C. Treude, M. W. Godfrey, C. Liu, and W. Charoenwet, “Leveraging reviewer experience in code review comment generation,” arXiv preprint arXiv:2409.10959 , 2024

work page arXiv 2024

[11] [11]

Code reviewing in the trenches: Challenges and best practices,

L. MacLeod, M. Greiler, M.-A. Storey, C. Bird, and J. Czerwonka, “Code reviewing in the trenches: Challenges and best practices,” IEEE Software, vol. 35, no. 4, pp. 34–42, 2017

work page 2017

[12] [12]

GitHub - brinnarlyne8585/AIReviewActionAnalysis — github.com,

“GitHub - brinnarlyne8585/AIReviewActionAnalysis — github.com,” https://github.com/brinnarlyne8585/AIReviewActionAnalysis, 2025, to be published on a preserved archive after acceptance, accessed 30-05- 2025

work page 2025

[13] [13]

Towards evaluation guidelines for empirical studies involving llms,

S. Wagner, M. M. Bar ´on, D. Falessi, and S. Baltes, “Towards evaluation guidelines for empirical studies involving llms,” in 2nd International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE 2025) , 2025

work page 2025

[14] [14]

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

S. Baltes, F. Angermeir, C. Arora, M. M. Bar ´on, C. Chen, L. B ¨ohme, F. Calefato, N. Ernst, D. Falessi, B. Fitzgerald, D. Fucci, M. Kalinowski, S. Lambiase, D. Russo, M. Lungu, L. Prechelt, P. Ralph, C. Treude, and S. Wagner, “Evaluation guidelines for empirical studies in software engineering involving llms,” 2025. [Online]. Available: https://arxiv.or...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Definition of “hunk

“Definition of “hunk” in the gnu diffutils manual,” https://www.gnu.org/ software/diffutils/manual/html node/Hunks.html, 2025

work page 2025

[16] [16]

“Openai,” https://chat.openai.com, 2025

work page 2025

[17] [17]

“Gemini,” https://gemini.google.com/, 2025

work page 2025

[18] [18]

“Claude,” https://claude.ai, 2025

work page 2025

[19] [19]

Testability refactoring in pull requests: Patterns and trends,

P. Reich and W. Maalej, “Testability refactoring in pull requests: Patterns and trends,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20,

work page 2023

[20] [20]

1508–1519

IEEE, 2023, pp. 1508–1519

work page 2023

[21] [21]

Bot or not? detecting bots in github pull request activity based on comment similarity,

M. Golzadeh, D. Legay, A. Decan, and T. Mens, “Bot or not? detecting bots in github pull request activity based on comment similarity,” in Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops , 2020, pp. 31–35

work page 2020

[22] [22]

Deepseek,

“Deepseek,” https://deepseek.com, 2025

work page 2025

[23] [23]

A unified approach to interpreting model predictions,

S. M. Lundberg and S. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , 2017, pp. 4765–4774

work page 2017

[24] [24]

In: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

T. Dey and A. Mockus, “Effect of technical and social factors on pull request quality for the NPM ecosystem,” in ESEM ’20: ACM / IEEE International Symposium on Empirical Software Engineering and Measurement, Bari, Italy, October 5-7, 2020 . ACM, 2020, pp. 11:1– 11:11. [Online]. Available: https://doi.org/10.1145/3382494.3410685

work page doi:10.1145/3382494.3410685 2020

[25] [25]

Latent dirichlet allocation,

D. M. Blei, A. Y . Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research , vol. 3, no. Jan, pp. 993–1022, 2003

work page 2003

[26] [26]

Predicting good configurations for github and stack overflow topic models,

C. Treude and M. Wagner, “Predicting good configurations for github and stack overflow topic models,” in Proceedings of the 16th Interna- tional Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada . IEEE / ACM, 2019, pp. 84–95

work page 2019

[27] [27]

Autospearman: Automatically mitigating correlated software metrics for interpreting defect models,

J. Jiarpakdee, C. Tantithamthavorn, and C. Treude, “Autospearman: Automatically mitigating correlated software metrics for interpreting defect models,” in 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, September 23-29, 2018. IEEE Computer Society, 2018, pp. 92–103

work page 2018

[28] [28]

Automated code review in practice,

U. Cihan, V . Haratian, A. ˙Ic ¸¨oz, M. K. G ¨ul, ¨Omercan Devran, E. F. Bayendur, B. M. Uc ¸ar, and E. T ¨uz¨un, “Automated code review in practice,” 2024. [Online]. Available: https://arxiv.org/abs/2412.18531

work page arXiv 2024

[29] [29]

Language matters,

Y . Wang, “Language matters,” in 2015 ACM/IEEE International Sym- posium on Empirical Software Engineering and Measurement (ESEM) , 2015, pp. 1–10

work page 2015

[30] [30]

Linevul: A transformer-based line- level vulnerability prediction,

M. Fu and C. Tantithamthavorn, “Linevul: A transformer-based line- level vulnerability prediction,” in Proceedings of the 19th International Conference on Mining Software Repositories , 2022, pp. 608–620

work page 2022

[31] [31]

Vulre- pair: a t5-based automated software vulnerability repair,

M. Fu, C. Tantithamthavorn, T. Le, V . Nguyen, and D. Phung, “Vulre- pair: a t5-based automated software vulnerability repair,” in Proceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering , 2022, pp. 935– 947

work page 2022

[32] [32]

Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources,

X. Zhou, K. Kim, B. Xu, D. Han, and D. Lo, “Out of sight, out of mind: Better automatic vulnerability repair by broadening input ranges and sources,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering , 2024, pp. 1–13

work page 2024

[33] [33]

Vision transformer inspired automated vulnerability repair,

M. Fu, V . Nguyen, C. Tantithamthavorn, D. Phung, and T. Le, “Vision transformer inspired automated vulnerability repair,” ACM Transactions on Software Engineering and Methodology , vol. 33, no. 3, pp. 1–29, 2024

work page 2024

[34] [34]

An empirical study on learning bug-fixing patches in the wild via neural machine translation,

M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and D. Poshyvanyk, “An empirical study on learning bug-fixing patches in the wild via neural machine translation,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 28, no. 4, pp. 1–29, 2019

work page 2019

[35] [35]

Cure: Code-aware neural machine translation for automatic program repair,

N. Jiang, T. Lutellier, and L. Tan, “Cure: Code-aware neural machine translation for automatic program repair,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 1161–1173

work page 2021

[36] [36]

Inferfix: End-to-end program repair with llms,

M. Jin, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1646–1656

work page 2023

[37] [37]

On the” naturalness

B. Ray, V . Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. De- vanbu, “On the” naturalness” of buggy code,” inProceedings of the 38th International Conference on Software Engineering , 2016, pp. 428–439

work page 2016

[38] [38]

On the naturalness of software,

A. Hindle, E. T. Barr, M. Gabel, Z. Su, and P. Devanbu, “On the naturalness of software,” Communications of the ACM , vol. 59, no. 5, pp. 122–131, 2016

work page 2016

[39] [39]

A survey of machine learning for big code and naturalness,

M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,” ACM Computing Surveys (CSUR), vol. 51, no. 4, pp. 1–37, 2018

work page 2018

[40] [40]

The code review comprehension assessment for large language models,

H. Y . Lin, C. Liu, H. Gao, P. Thongtanunam, and C. Treude, “The code review comprehension assessment for large language models,” in Findings of the Association for Computational Linguistics: ACL 2025 , 2025

work page 2025

[41] [41]

Automating code review activities by large-scale pre-training,

Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, A. Svyatkovskiy, S. Fu et al. , “Automating code review activities by large-scale pre-training,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2022, pp. 1035–1047

work page 2022

[42] [42]

Towards automating code review at scale,

V . J. Hellendoorn, J. Tsay, M. Mukherjee, and M. Hirzel, “Towards automating code review at scale,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2021, pp. 1479–1482

work page 2021

[43] [43]

Using pre-trained models to boost code review automa- tion,

R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk, and G. Bavota, “Using pre-trained models to boost code review automa- tion,” in Proceedings of the 44th international conference on software engineering, 2022, pp. 2291–2302. 14

work page 2022

[44] [44]

Auger: automatically generating review comments with pre-training models,

L. Li, L. Yang, H. Jiang, J. Yan, T. Luo, Z. Hua, G. Liang, and C. Zuo, “Auger: automatically generating review comments with pre-training models,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1009–1021

work page 2022

[45] [45]

Improving automated code reviews: Learning from experience,

H. Y . Lin, P. Thongtanunam, C. Treude, and W. Charoenwet, “Improving automated code reviews: Learning from experience,” in Proceedings of the 21st International Conference on Mining Software Repositories , 2024, pp. 278–283

work page 2024

[46] [46]

Towards automated code reviews: Does learning code structure help?

H. Y . Lin and P. Thongtanunam, “Towards automated code reviews: Does learning code structure help?” in 2023 IEEE International Con- ference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 2023, pp. 703–707

work page 2023

[47] [47]

Autotrans- form: Automated code transformation to support modern code review process,

P. Thongtanunam, C. Pornprasit, and C. Tantithamthavorn, “Autotrans- form: Automated code transformation to support modern code review process,” inProceedings of the 44th international conference on software engineering, 2022, pp. 237–248

work page 2022

[48] [48]

Improving the learning of code review successive tasks with cross-task knowledge distillation,

O. Ben Sghaier and H. Sahraoui, “Improving the learning of code review successive tasks with cross-task knowledge distillation,” Proceedings of the ACM on Software Engineering , vol. 1, no. FSE, pp. 1086–1106, 2024

work page 2024

[49] [49]

Cct5: A code- change-oriented pre-trained model,

B. Lin, S. Wang, Z. Liu, Y . Liu, X. Xia, and X. Mao, “Cct5: A code- change-oriented pre-trained model,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2023, pp. 1509–1521

work page 2023

[50] [50]

Expectation vs. experi- ence: Evaluating the usability of code generation tools powered by large language models,

P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experi- ence: Evaluating the usability of code generation tools powered by large language models,” in Chi conference on human factors in computing systems extended abstracts , 2022, pp. 1–7

work page 2022

[51] [51]

A comparative study on method comment and inline comment,

Y . Huang, H. Guo, X. Ding, J. Shu, X. Chen, X. Luo, Z. Zheng, and X. Zhou, “A comparative study on method comment and inline comment,” ACM Trans. Softw. Eng. Methodol., vol. 32, no. 5, pp. 126:1– 126:26, 2023. [Online]. Available: https://doi.org/10.1145/3582570

work page doi:10.1145/3582570 2023