Habituation at the Gate: Rising Approval and Declining Scrutiny in Human Review of AI Agent Code

Haoran Yu; Lifei Liu; Pin Qian; Su Wang; Xiaochong Jiang; Yihang Chen; Yuwen Jia

arxiv: 2606.22721 · v1 · pith:VFURDW22new · submitted 2026-06-21 · 💻 cs.SE

Habituation at the Gate: Rising Approval and Declining Scrutiny in Human Review of AI Agent Code

Haoran Yu , Lifei Liu , Xiaochong Jiang , Yuwen Jia , Su Wang , Pin Qian , Yihang Chen This is my paper

Pith reviewed 2026-06-26 09:29 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI coding agentscode reviewhabituationpull requestshuman-AI collaborationapproval ratesopen sourcereviewer experience

0 comments

The pith

Reviewers approve more AI-generated code over time but comment less and wait longer, consistent with habituation under workload.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether human reviewers gradually reduce scrutiny of AI agent pull requests as they gain experience. It tracks 400 repeat reviewers across 11,429 reviews in the AIDev dataset over seven months and finds approval rates rise from 30.1% early to 36.8% late, a shift that survives controls for calendar time and is absent for human-submitted PRs. Comment volume falls 22% while review latency rises 3.5 times, even though PR sizes remain flat. The authors conclude the pattern fits reflexive habituation under growing workload better than rational trust calibration.

Core claim

Approval rates for AI agent code increase with within-reviewer experience, reaching a +14.5 pp cumulative gap from first to tenth decile, while inline comments decline and queue times lengthen; the combination points to habituation rather than calibrated trust.

What carries the argument

Within-reviewer longitudinal comparison of early versus late review episodes for the same individuals, pooled by experience decile and controlled for calendar time, agent type, and PR size.

If this is right

Approval rates for AI PRs rise with reviewer experience while rates for human PRs fall over the same period.
Inline comment volume decreases 22% even as median review latency increases 3.5 times.
The approval increase persists after controlling for calendar time and is not explained by changes in PR size.
The observed pattern is interpreted as more consistent with reflexive habituation than with rational trust calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Habituation could allow more low-quality AI code to enter open-source repositories if workload continues to grow.
Platforms might test interventions such as periodic reviewer rotation or explicit scrutiny prompts to counteract the effect.
Similar experience-driven drops in scrutiny may appear in other repeated human review tasks involving AI outputs, such as content moderation.

Load-bearing premise

The within-reviewer design and controls for calendar time, agent type, and PR size fully isolate experience-driven habituation from unmeasured shifts in reviewer behavior or PR characteristics.

What would settle it

A dataset tracking the same reviewers on AI PRs where approval rates stay flat or decline and comment volume remains constant despite rising workload.

Figures

Figures reproduced from arXiv: 2606.22721 by Haoran Yu, Lifei Liu, Pin Qian, Su Wang, Xiaochong Jiang, Yihang Chen, Yuwen Jia.

**Figure 2.** Figure 2: Monthly approval rate for agent PRs (solid) vs. hu [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

As AI coding agents (e.g., GitHub Copilot, Devin, OpenAI Codex, Cursor) submit pull requests to open-source repositories at scale, a key question arises: do human reviewers gradually lower their scrutiny for AI-generated code over time? We conduct a longitudinal within-reviewer analysis using the AIDev dataset, studying 400 repeat reviewers who collectively submitted 11,429 reviews over a seven-month observation period. Comparing each reviewer's early and late review episodes, we observe a population-level shift in approval rate from 30.1% to 36.8% (Wilcoxon signed-rank p < 10^{-6} on paired shifts). Pooled by within-reviewer experience decile, the cumulative gap reaches +14.5 pp from first to tenth decile. This shift is experience-driven (persists after controlling for calendar time), agent-specific (human PR approval rates decline over the same period), and not explained by PR difficulty (median PR size is flat). However, review latency increases rather than decreases (+3.5x), while inline comment volume decreases (-22%, p=0.0014), suggesting reviewers spend more time in queue but less time actively inspecting code. The combination of rising approval, declining comment effort, and increasing queue time is most consistent with reflexive habituation under growing workload rather than rational trust calibration alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The within-reviewer rise in AI PR approvals is the clearest signal, but the habituation interpretation still needs tighter controls and full methods to hold up.

read the letter

The paper's main finding is that the same 400 reviewers approve AI-generated pull requests more often as they review more of them, moving from 30% early to 37% later on average, with a 14.5 pp gap by the tenth experience decile. Human PR approvals move in the opposite direction over the same window, and median PR size stays flat. Reviewers also leave fewer inline comments while queue times lengthen. That pattern is what the authors tie to habituation under workload rather than growing rational trust.

The within-reviewer longitudinal design on this scale is the real step forward. Most earlier work on AI code review has been cross-sectional or relied on between-reviewer comparisons, so tracking the same people across 11k reviews over seven months gives a cleaner look at experience effects. The agent-specific contrast and the flat size metric are useful checks against obvious alternatives.

The soft spot is exactly where the stress-test note flags it. The abstract says the approval shift survives calendar-time controls, but it gives no regression table, fixed-effect structure, or list of additional covariates. If time enters only as coarse dummies and size is the only difficulty proxy, later PRs could still differ in language, complexity, or submitter experience in ways that raise approval rates without any change in reviewer behavior. The drop in comments and rise in latency are consistent with habituation, yet they could also reflect queue pressure or shifting reviewer attention without the paper showing robustness checks that close those channels.

This is for software engineering researchers who track how AI tools alter open-source review norms and for practitioners worried about cumulative quality drift. The raw patterns are worth attention even if the causal story needs more work.

I would send it to referees. The design is promising and the question matters at scale, but the write-up needs the full statistical details before the habituation claim can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper conducts a longitudinal within-reviewer analysis of 400 repeat reviewers and 11,429 reviews from the AIDev dataset over seven months. It reports a population-level rise in AI PR approval rates from 30.1% to 36.8% (Wilcoxon p < 10^{-6}), reaching a +14.5 pp cumulative gap across experience deciles. The shift is claimed to be experience-driven after calendar-time controls, agent-specific (human PR approvals fall), and independent of difficulty (median PR size flat). Review latency rises 3.5x while inline comments fall 22% (p=0.0014), interpreted as evidence of reflexive habituation under workload rather than rational trust calibration.

Significance. If the within-reviewer controls and robustness checks hold, the work supplies a large-scale observational benchmark on how human oversight of AI-generated code evolves in open-source settings. The paired early/late design, agent-specific contrast, and combination of approval, comment, and latency metrics are strengths that could inform software-engineering practice and AI deployment policies. The observational framing limits strong causal claims, but the pattern is falsifiable with the reported dataset.

major comments (2)

[Methods / statistical model] Methods / statistical model: The abstract states that the +14.5 pp approval rise 'persists after controlling for calendar time' and is 'not explained by PR difficulty (median PR size is flat),' yet supplies no regression specification, fixed-effect structure (e.g., reviewer FE, time FE, or interaction terms), covariate list, or robustness table. Without these details it is impossible to verify whether the experience-driven claim survives plausible alternative specifications such as finer-grained time trends or additional difficulty proxies.
[Results, experience-decile analysis] Results, experience-decile analysis: The claim that the approval increase is independent of difficulty rests on median PR size being flat, but size is only one proxy; if later PRs differ systematically in language, complexity, or submitter experience (unmeasured in the reported controls), the within-reviewer design alone does not close this channel. A table showing coefficient stability across alternative difficulty measures would be required to support the central interpretation.

minor comments (2)

[Abstract] Abstract: The phrase 'pooled by within-reviewer experience decile' should be clarified with the exact decile construction and whether deciles are reviewer-specific or global.
[Abstract] The Wilcoxon signed-rank test is reported on paired shifts; the exact pairing (reviewer-level early vs. late) and handling of ties or missing data should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major point below and will revise the paper accordingly where feasible.

read point-by-point responses

Referee: [Methods / statistical model] Methods / statistical model: The abstract states that the +14.5 pp approval rise 'persists after controlling for calendar time' and is 'not explained by PR difficulty (median PR size is flat),' yet supplies no regression specification, fixed-effect structure (e.g., reviewer FE, time FE, or interaction terms), covariate list, or robustness table. Without these details it is impossible to verify whether the experience-driven claim survives plausible alternative specifications such as finer-grained time trends or additional difficulty proxies.

Authors: The referee is correct that the main text does not present the explicit regression equation, fixed-effects structure, or robustness table. The reported analysis relies on a within-reviewer paired comparison (early vs. late reviews per reviewer) with calendar-time controls, but these details are only summarized. We will revise the Methods section to include the full specification (e.g., approval rate modeled with reviewer fixed effects, calendar time fixed effects or trends, and experience decile as the key predictor), list all covariates, and add an appendix table with coefficient stability under alternative time-trend specifications (linear, quadratic, and month fixed effects). revision: yes
Referee: [Results, experience-decile analysis] Results, experience-decile analysis: The claim that the approval increase is independent of difficulty rests on median PR size being flat, but size is only one proxy; if later PRs differ systematically in language, complexity, or submitter experience (unmeasured in the reported controls), the within-reviewer design alone does not close this channel. A table showing coefficient stability across alternative difficulty measures would be required to support the central interpretation.

Authors: We agree that median PR size is only one proxy and that unmeasured shifts in complexity, language, or submitter experience could remain. The within-reviewer design eliminates reviewer-specific time-invariant confounders and the agent-specific contrast (declining human-PR approvals over the same window) provides indirect support against a general difficulty increase. However, the AIDev dataset does not contain additional difficulty proxies such as cyclomatic complexity, language-specific metrics, or submitter experience for the full sample. We will therefore expand the Limitations section to discuss this gap explicitly but cannot produce the requested coefficient-stability table across alternative measures. revision: partial

standing simulated objections not resolved

Request for a table of coefficient stability across alternative difficulty measures (language, complexity, submitter experience), as these variables are unavailable in the AIDev dataset.

Circularity Check

0 steps flagged

No circularity: purely observational analysis of external dataset

full rationale

The paper conducts a longitudinal within-reviewer statistical comparison on the AIDev dataset using paired tests (Wilcoxon signed-rank) and controls for calendar time, agent type, and PR size. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or derivations appear in the provided text. All reported shifts (approval rates, comment volume, latency) are direct empirical observations, making the analysis self-contained against external data without reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical assumptions for paired non-parametric tests and the premise that the AIDev dataset accurately captures reviewer experience and PR metadata; no free parameters or invented entities are introduced.

axioms (1)

standard math Wilcoxon signed-rank test is appropriate for paired within-reviewer approval rate shifts
Used to obtain p < 10^{-6}

pith-pipeline@v0.9.1-grok · 5799 in / 1236 out tokens · 24430 ms · 2026-06-26T09:29:58.223181+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 7 canonical work pages

[1]

Alberto Bacchelli and Christian Bird. 2013. Expectations, Outcomes, and Chal- lenges of Modern Code Review. InProceedings of the 35th International Conference on Software Engineering (ICSE ’13). IEEE Press, 712–721. doi:10.1109/ICSE.2013. 6606617

work page doi:10.1109/icse.2013 2013
[2]

Nathan Cassee, Bogdan Vasilescu, and Alexander Serebrenik. 2020. The Silent Helper: The Impact of Continuous Integration on Code Reviews. InProceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengi- neering (SANER ’20). IEEE, 423–434. doi:10.1109/SANER48275.2020.9054818

work page doi:10.1109/saner48275.2020.9054818 2020
[3]

In: Proceedings of the 11th Working Conference on Mining Software Repositories

Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. InProceedings of the 11th Working Conference on Mining Software Repositories (MSR ’14). ACM, 92–101. doi:10.1145/2597073.2597074

work page doi:10.1145/2597073.2597074 2014
[4]

Oleksii Kononenko, Olga Baysal, Latifa Guerrouj, Yaxin Cao, and Michael W. Godfrey. 2015. Investigating Code Review Quality: Do People and Participation Matter?. InProceedings of the 31st IEEE International Conference on Software Maintenance and Evolution (ICSME ’15). IEEE, 111–120. doi:10.1109/ICSM.2015. 7332457

work page doi:10.1109/icsm.2015 2015
[5]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/ abs/2507.15003

Pith/arXiv arXiv 2025
[6]

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundare- san. 2022. Automating Code Review Activities by Large-Scale Pre-training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering ...

work page doi:10.1145/3540250.3549081 2022
[7]

Shane McIntosh, Yasutaka Kamei, Bram Adams, and Ahmed E. Hassan. 2016. An Empirical Study of the Impact of Modern Code Review Practices on Software Quality.Empirical Software Engineering21, 5 (2016), 2146–2189. doi:10.1007/ s10664-015-9381-9

2016
[8]

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590(2023)

Pith/arXiv arXiv 2023
[9]

Rigby and Christian Bird

Peter C. Rigby and Christian Bird. 2013. Convergent Contemporary Software Peer Review Practices. InProceedings of the 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE ’13). ACM, 202–212. doi:10.1145/2491411.2491444

work page doi:10.1145/2491411.2491444 2013
[10]

Bissyandé

Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and Tegawendé F. Bissyandé. 2024. CodeAgent: Autonomous Communicative Agents for Code Review. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP ’24). 11279–11313. doi:10.18653/v1/2024.emnlp-main.632

work page doi:10.18653/v1/2024.emnlp-main.632 2024
[11]

Mairieli Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco Aurélio Gerosa. 2020. Effects of Adopting Code Review Bots on Pull Requests to OSS Projects. InProceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME ’20). IEEE, 1–11. doi:10.1109/ ICSME46990.2020.00011 2https://anonymous.4open.science/r/r...

arXiv 2020

[1] [1]

Alberto Bacchelli and Christian Bird. 2013. Expectations, Outcomes, and Chal- lenges of Modern Code Review. InProceedings of the 35th International Conference on Software Engineering (ICSE ’13). IEEE Press, 712–721. doi:10.1109/ICSE.2013. 6606617

work page doi:10.1109/icse.2013 2013

[2] [2]

Nathan Cassee, Bogdan Vasilescu, and Alexander Serebrenik. 2020. The Silent Helper: The Impact of Continuous Integration on Code Reviews. InProceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengi- neering (SANER ’20). IEEE, 423–434. doi:10.1109/SANER48275.2020.9054818

work page doi:10.1109/saner48275.2020.9054818 2020

[3] [3]

In: Proceedings of the 11th Working Conference on Mining Software Repositories

Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M. German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. InProceedings of the 11th Working Conference on Mining Software Repositories (MSR ’14). ACM, 92–101. doi:10.1145/2597073.2597074

work page doi:10.1145/2597073.2597074 2014

[4] [4]

Oleksii Kononenko, Olga Baysal, Latifa Guerrouj, Yaxin Cao, and Michael W. Godfrey. 2015. Investigating Code Review Quality: Do People and Participation Matter?. InProceedings of the 31st IEEE International Conference on Software Maintenance and Evolution (ICSME ’15). IEEE, 111–120. doi:10.1109/ICSM.2015. 7332457

work page doi:10.1109/icsm.2015 2015

[5] [5]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/ abs/2507.15003

Pith/arXiv arXiv 2025

[6] [6]

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundare- san. 2022. Automating Code Review Activities by Large-Scale Pre-training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering ...

work page doi:10.1145/3540250.3549081 2022

[7] [7]

Shane McIntosh, Yasutaka Kamei, Bram Adams, and Ahmed E. Hassan. 2016. An Empirical Study of the Impact of Modern Code Review Practices on Software Quality.Empirical Software Engineering21, 5 (2016), 2146–2189. doi:10.1007/ s10664-015-9381-9

2016

[8] [8]

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590(2023)

Pith/arXiv arXiv 2023

[9] [9]

Rigby and Christian Bird

Peter C. Rigby and Christian Bird. 2013. Convergent Contemporary Software Peer Review Practices. InProceedings of the 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE ’13). ACM, 202–212. doi:10.1145/2491411.2491444

work page doi:10.1145/2491411.2491444 2013

[10] [10]

Bissyandé

Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and Tegawendé F. Bissyandé. 2024. CodeAgent: Autonomous Communicative Agents for Code Review. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP ’24). 11279–11313. doi:10.18653/v1/2024.emnlp-main.632

work page doi:10.18653/v1/2024.emnlp-main.632 2024

[11] [11]

Mairieli Wessel, Alexander Serebrenik, Igor Wiese, Igor Steinmacher, and Marco Aurélio Gerosa. 2020. Effects of Adopting Code Review Bots on Pull Requests to OSS Projects. InProceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME ’20). IEEE, 1–11. doi:10.1109/ ICSME46990.2020.00011 2https://anonymous.4open.science/r/r...

arXiv 2020