arxiv: 2604.09409 · v1 · submitted 2026-04-10 · 💻 cs.SE

Recognition: unknown

Do AI Coding Agents Log Like Humans? An Empirical Study

Youssef Esseddiq Ouatiti , Mohammed Sayagh , Hao Li , Ahmed E. Hassan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:49 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI coding agentssoftware loggingempirical studypull requestsnatural language instructionsagent complianceobservability

0 comments

The pith

AI coding agents modify logging statements less frequently than humans and often disregard explicit logging instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how AI coding agents manage software logging compared to human developers, focusing on changes made in pull requests and the role of natural language prompts. It analyzes thousands of agent-generated contributions to see if agents add, edit, or remove log statements at similar rates and with similar care as people. The study reveals agents alter logs less often overall yet produce denser log changes when they act, while explicit instructions to improve logging prove both uncommon and largely ignored. Humans end up correcting most resulting issues after the fact. This matters because logging supports debugging and system maintenance, so differences here could affect the reliability of AI-assisted codebases over time.

Core claim

Across 4,550 agentic pull requests in 81 repositories, agents change logging less often than humans in 58.4 percent of cases, though they produce higher log density when changes occur. Explicit logging instructions appear in only 4.7 percent of cases and agents fail to follow constructive requests 67 percent of the time. Humans perform 72.5 percent of post-generation log repairs, functioning as silent janitors who address observability problems without explicit review signals. These patterns point to a dual shortfall in natural language guidance: scarcity of instructions and low agent compliance.

What carries the argument

The side-by-side comparison of logging modifications in agentic pull requests versus human baselines, combined with detection of explicit natural-language instructions and measurement of compliance and repair rates.

If this is right

Natural language instructions alone do not reliably govern non-functional requirements such as logging.
Agents produce code that may require subsequent human fixes to maintain observability.
Consistent logging practices could depend on mechanisms other than prompts.
Human review continues to play a central role in correcting AI-generated logging issues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

AI coding tools might benefit from built-in defaults or templates for logging instead of relying solely on user prompts.
Similar patterns could appear in other non-functional areas such as error handling or performance instrumentation.
Controlled experiments could test whether adding deterministic checks improves logging outcomes beyond what natural language achieves.

Load-bearing premise

The 4,550 pull requests can be accurately labeled as AI-generated and logging changes can be detected and classified without systematic bias from repository selection or detection methods.

What would settle it

A follow-up analysis that uses an independent method to identify AI-generated code and finds agents changing logging more often than humans across a comparable set of repositories.

Figures

Figures reproduced from arXiv: 2604.09409 by Ahmed E. Hassan, Hao Li, Mohammed Sayagh, Youssef Esseddiq Ouatiti.

**Figure 2.** Figure 2: The prompt used to identify logging instructions. The [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Repository-level comparison of logging prevalence in human and agentic PRs. Panel (a) shows the [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Repository-level comparison of log density in human and agentic PRs. Panel (a) shows the distribution [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: PR-level relationship between change size and log density. Points represent individual PRs (transparent). [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Repository-level comparison of log-message length in human and agentic PRs. Panel (a) shows the [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Percentage of repositories in three categories for each log level: gray = similar usage, blue = agents use [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Aggregated syntactic-context comparison across repositories. Left bars show repositories where [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Overview of our approach to characterize agentic logging characteristics. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: RQ3 methodology and lifecycle setup. Summary of RQ2 Our analysis identifies both a specification gap and a compliance gap in agentic logging. Specifically, 98.7% of agentic PRs lack logging instructions, and even when explicitly instructed, agents do not comply with requests 67% of the time. Crucially, this low compliance persists despite 73% of logging instructions being detailed (specifying log levels,… view at source ↗

**Figure 11.** Figure 11: Post-generation logging revision flow for agentic and human PRs. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Kaplan–Meier survival curves for post-generation logging stability in shared repositories (Agent vs [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: PR size versus post-generation logging revisions in agentic and human PRs. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Software logging is essential for maintaining and debugging complex systems, yet it remains unclear how AI coding agents handle this non-functional requirement. While prior work characterizes human logging practices, the behaviors of AI coding agents and the efficacy of natural language instructions in governing them are unexplored. To address this gap, we conduct an empirical study of 4,550 agentic pull requests across 81 open-source repositories. We compare agent logging patterns against human baselines and analyze the impact of explicit logging instructions. We find that agents change logging less often than humans in 58.4% of repositories, though they exhibit higher log density when they do. Furthermore, explicit logging instructions are rare (4.7%) and ineffective, as agents fail to comply with constructive requests 67% of the time. Finally, we observe that humans perform 72.5% of post-generation log repairs, acting as "silent janitors" who fix logging and observability issues without explicit review feedback. These findings indicate a dual failure in natural language instruction (i.e., scarcity of logging instructions and low agent compliance), suggesting that deterministic guardrails might be necessary to ensure consistent logging practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AI agents change logging less often than humans in most repos but the paper's claims rest on unvalidated rules for spotting agentic PRs and logging edits.

read the letter

The main things to know are that agents alter logging in fewer repositories than humans (58.4 percent) while showing higher density when they do change it, and that explicit logging instructions appear in only 4.7 percent of cases with 67 percent non-compliance. Humans then handle 72.5 percent of the repairs afterward. That gives a clear picture of a gap in how current agents treat observability compared to prior human baselines, drawn from 4,550 agentic PRs across 81 repos. The scale and the direct comparison to existing human logging studies are the parts that stand out as useful new data points. The numbers on instruction rarity and the hidden human cleanup work are straightforward to take away and relevant for anyone building or evaluating code agents. The soft spot sits in the data extraction steps. Partitioning PRs into agentic versus human and pulling out the actual logging changes both rely on heuristics around metadata and pattern matching, with no precision, recall, or sensitivity numbers reported. If those rules over-select certain tools or miss conditional logs, the percentages and the inference of a dual failure in natural language instructions could move. The suggestion for deterministic guardrails follows logically from the observations but would be stronger with evidence that the measurements hold up under different detection choices. This paper is for software engineering researchers and tool builders who care about non-functional properties in generated code. It supplies a timely baseline even with the method gaps. It deserves peer review because the question is practical and the dataset size is large enough to be worth cleaning up, though the referees will need to press on validation of the agent identification and logging detection rules.

Referee Report

4 major / 2 minor

Summary. The paper conducts an empirical study of logging behaviors in 4,550 agentic pull requests across 81 open-source repositories. It compares AI coding agents to human baselines, finding that agents alter logging less frequently than humans in 58.4% of repositories (but with higher log density when they do), that explicit logging instructions appear in only 4.7% of cases and are followed only 33% of the time, and that humans perform 72.5% of subsequent log repairs. The authors interpret these results as evidence of a dual failure of natural-language instructions and recommend deterministic guardrails instead.

Significance. If the measurements hold after validation, the work provides concrete data on how current AI agents handle a key non-functional requirement (logging) that prior human-focused studies have characterized. The scale (81 repos, thousands of PRs) and the observation that humans act as post-hoc “silent janitors” are useful for tool builders and for organizations adopting agentic workflows. The practical suggestion for guardrails is directly actionable.

major comments (4)

[§3] §3 (or Methods section on data collection): The identification of the 4,550 agentic PRs is described as relying on repository selection plus detection rules (likely commit/PR metadata or tool signatures), yet no precision, recall, inter-rater agreement, or sensitivity analysis is reported. Because every comparative statistic (58.4%, 67%, 72.5%) depends on correct partitioning of PRs into agentic vs. human, the absence of validation metrics makes the central claims unverifiable from the given evidence.
[§3.2] §3.2 (logging-change detection): The rules used to extract and classify logging edits from diffs (keyword, AST, or other patterns) are not accompanied by accuracy figures or manual audit results. Systematic under-detection of conditional, refactored, or framework-specific logging statements would directly bias the reported frequencies and the inference of non-compliance.
[§4] §4 (results on instruction compliance): The claim that agents “fail to comply with constructive requests 67% of the time” rests on an operational definition of “explicit logging instructions” and “constructive requests” that is not detailed; without inter-annotator reliability or a clear coding scheme, the 67% figure cannot be assessed for reproducibility or selection bias.
[§4, §5] §4 and §5: No statistical tests, confidence intervals, or controls for confounders (repository language, size, agent tool, PR age) are mentioned for the key percentages. The 58.4% “less often” claim and the 72.5% human-repair claim therefore lack evidence that the observed differences are not artifacts of the sampled repositories or detection heuristics.

minor comments (2)

[Introduction] The abstract and introduction cite prior human logging studies but do not explicitly contrast the current detection heuristics with those used in the cited works; a short methods-comparison paragraph would improve context.
[Results] Table or figure captions that report the 58.4%, 67%, and 72.5% figures should also state the exact denominators and any filtering steps applied.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments help us improve the transparency and rigor of our empirical analysis. We respond to each major comment below and outline the revisions we will incorporate.

read point-by-point responses

Referee: [§3] §3 (or Methods section on data collection): The identification of the 4,550 agentic PRs is described as relying on repository selection plus detection rules (likely commit/PR metadata or tool signatures), yet no precision, recall, inter-rater agreement, or sensitivity analysis is reported. Because every comparative statistic (58.4%, 67%, 72.5%) depends on correct partitioning of PRs into agentic vs. human, the absence of validation metrics makes the central claims unverifiable from the given evidence.

Authors: We agree that validation metrics are important for establishing the reliability of our agentic PR identification. The method in §3 used a combination of automated detection based on known AI coding agent signatures in commit messages, PR titles, and author metadata, cross-referenced with repository selection criteria. While we did not report precision and recall in the initial submission, we will add a sensitivity analysis and manual validation results on a sample of PRs in the revised manuscript to quantify the accuracy of this partitioning. revision: yes
Referee: [§3.2] §3.2 (logging-change detection): The rules used to extract and classify logging edits from diffs (keyword, AST, or other patterns) are not accompanied by accuracy figures or manual audit results. Systematic under-detection of conditional, refactored, or framework-specific logging statements would directly bias the reported frequencies and the inference of non-compliance.

Authors: We recognize that the logging edit detection rules require validation to rule out systematic biases. Our approach combined keyword searches for common logging APIs with AST-based diff analysis to identify changes. In the revision, we will include results from a manual audit of a random sample of diffs, reporting accuracy metrics and discussing any limitations related to framework-specific or conditional logging statements. revision: yes
Referee: [§4] §4 (results on instruction compliance): The claim that agents “fail to comply with constructive requests 67% of the time” rests on an operational definition of “explicit logging instructions” and “constructive requests” that is not detailed; without inter-annotator reliability or a clear coding scheme, the 67% figure cannot be assessed for reproducibility or selection bias.

Authors: The 67% figure is derived from a manual classification of PR comments and descriptions for explicit requests related to logging. We will provide a detailed coding scheme with examples in the revised §4, including the criteria for 'constructive' vs. vague requests. If the original annotation was performed by a single researcher, we will note this limitation and consider adding a second annotator for a subset to compute agreement in the revision. revision: partial
Referee: [§4, §5] §4 and §5: No statistical tests, confidence intervals, or controls for confounders (repository language, size, agent tool, PR age) are mentioned for the key percentages. The 58.4% “less often” claim and the 72.5% human-repair claim therefore lack evidence that the observed differences are not artifacts of the sampled repositories or detection heuristics.

Authors: We agree that adding statistical rigor would strengthen the presentation of results. In the revised manuscript, we will include appropriate statistical tests (e.g., proportion tests or chi-squared tests) with confidence intervals for the key percentages. We will also analyze and report on potential confounders such as programming language, repository size, and agent tool type, providing controlled comparisons where feasible. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical measurements

full rationale

This is an empirical study that directly measures logging changes in 4,550 identified agentic PRs across 81 repositories, compares frequencies and densities to human baselines, and counts instruction compliance and repair rates. No equations, fitted parameters, derivations, predictions, or ansatzes appear; all reported percentages (58.4%, 4.7%, 67%, 72.5%) are raw observational counts. No self-citation load-bearing steps or uniqueness theorems are invoked. The derivation chain is simply data extraction followed by tabulation, with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claims rest on domain assumptions about the ability to correctly label AI-generated pull requests and to detect logging modifications in source code changes.

axioms (3)

domain assumption AI coding agents can be reliably identified from pull request metadata or commit patterns in the selected repositories
The entire comparison depends on distinguishing 'agentic' PRs from human ones.
domain assumption Changes to logging statements can be accurately and consistently detected across both agent and human contributions
Core metrics of change frequency and log density rely on this detection.
domain assumption Explicit logging instructions within PR descriptions or comments can be identified and their compliance objectively measured
Used to quantify instruction rarity and agent failure rate.

pith-pipeline@v0.9.0 · 5509 in / 1491 out tokens · 75873 ms · 2026-05-10T16:49:40.302652+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 20 canonical work pages · 2 internal anchors

[1]

Mohamed Batoun, Mohammed Sayagh, Roozbeh Aghili, Ali Ouni, and Heng Li. 2024. A literature review and existing challenges on software logging practices.Empirical Software Engineering29 (06 2024). doi:10.1007/s10664-024-10452-w

work page doi:10.1007/s10664-024-10452-w 2024
[2]

Yi-Hung Chou, Yiyang Min, April Yi Wang, and James A. Jones. 2025. Learning from Mistakes: Understanding Ad-hoc Logs through Analyzing Accidental Commits . In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE Computer Society, Los Alamitos, CA, USA, 1–13. doi:10.1109/MSR66628.2025.00017

work page doi:10.1109/msr66628.2025.00017 2025
[3]

Patrick Loic Foalem, Foutse Khomh, and Heng Li. 2024. Studying logging practice in machine learning-based applica- tions.Information and Software Technology170, C (2024), 17 pages

2024
[4]

Qiang Fu, Jieming Zhu, Wenlu Hu, Jian-Guang Lou, Rui Ding, Qingwei Lin, Dongmei Zhang, and Tao Xie. 2014. Where do developers log? an empirical study on logging practices in industry. InCompanion Proceedings of the 36th International Conference on Software Engineering. 24–33

2014
[5]

GitHub. 2024. The State of Open Source and AI: The 2024 Octoverse Report. https://github.blog/news-insights/ octoverse/octoverse-2024/. Accessed: 2025-02-09

2024
[6]

Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. 2025. Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor’s Impact on Software Projects. arXiv:2511.04427 [cs.SE] https: //arxiv.org/abs/2511.04427

work page arXiv 2025
[7]

Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents. arXiv:2511.04824 [cs.SE] https://arxiv.org/abs/2511.04824

work page arXiv 2025
[8]

Kabinna, W

S. Kabinna, W. Shang, C. Bezemer, and A. E. Hassan. 2016. Examining the Stabity of Logging Statements. InProceedings of the 23rd International Conference on Software Analysis, Evolution, and Reengineering, Vol. 1. 326–337

2016
[9]

E. L. Kaplan and Paul Meier. 1958. Nonparametric Estimation from Incomplete Observations.J. Amer. Statist. Assoc. 53, 282 (1958), 457–481. arXiv:https://www.tandfonline.com/doi/pdf/10.1080/01621459.1958.10501452 doi:10.1080/ 01621459.1958.10501452

work page doi:10.1080/01621459.1958.10501452 1958
[10]

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. InAdvances in Neural Information , Vol. 1, No. 1, Article . Publication date: April 2026. Do AI Coding Agents Log Like Humans? An Empirical Study 23 Processing Systems (Neu...

2022
[11]

Hao Li, Cor-Paul Bezemer, and Ahmed E. Hassan. 2025. Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models . In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE Computer Society, Los Alamitos, CA, USA, 307–318. doi:10.1109/ICSE-SEIP66...

work page doi:10.1109/icse-seip66354.2025.00033 2025
[12]

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579 [cs.CL] https://arxiv.org/abs/2412.05579

work page internal anchor Pith review arXiv 2024
[13]

Heng Li, Weiyi Shang, Bram Adams, Mohammed Sayagh, and Ahmed E. Hassan. 2021. A Qualitative Study of the Benefits and Costs of Logging From Developers’ Perspectives .IEEE Transactions on Software Engineering47, 12 (2021), 2858–2873

2021
[14]

Heng Li, Weiyi Shang, and Ahmed E. Hassan. 2017. Which log level should developers choose for a new logging statement?Empirical Software Engineering22, 4 (2017), 1684–1716. doi:10.1007/s10664-016-9456-2

work page doi:10.1007/s10664-016-9456-2 2017
[15]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/ abs/2507.15003

work page arXiv 2025
[16]

Zhenhao Li, Heng Li, Tse-Hsun Chen, and Weiyi Shang. 2021. DeepLV: Suggesting Log Levels Using Ordinal Based Neural Networks. In2021 IEEE/ACM 43rd International Conference on Software Engineering. 1461–1472

2021
[17]

Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Chakkrit Tantithamthavorn

Sherlock A. Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Kla Tantithamthavorn. 2025. Comparing Human and LLM Generated Code: The Jury is Still Out! arXiv:2501.16857 [cs.SE] https://arxiv.org/abs/2501.16857

work page arXiv 2025
[18]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics12 (2024), 157–173

2024
[19]

Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022. Using deep learning to generate complete log statements. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2279–2290. doi:10.1145/3510003.3511561

work page doi:10.1145/3510003.3511561 2022
[20]

Youssef Esseddiq Ouatiti. 2026. agentic_logging_RP. https://github.com/YoussefEssDS/agentic_logging_RP/tree/main. Replication package, accessed April 1, 2026

2026
[21]

Youssef Esseddiq Ouatiti, Mohammed Sayagh, Noureddine Kerzazi, Bram Adams, and Ahmed E. Hassan. 2024. The impact of concept drift and data leakage on log level prediction models.Empirical Software Engineering29, 5 (July 2024), 37 pages. doi:10.1007/s10664-024-10518-9

work page doi:10.1007/s10664-024-10518-9 2024
[22]

Youssef Esseddiq Ouatiti, Mohammed Sayagh, Noureddine Kerzazi, and Ahmed E. Hassan. 2023. An Empirical Study on Log Level Prediction for Multi-Component Systems.IEEE Transactions on Software Engineering49, 02 (2023), 473–484

2023
[23]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 27730–27744

2022
[24]

Antonio Pecchia, Marcello Cinque, Gabriella Carrozza, and Domenico Cotroneo. 2015. Industry Practices and Event Logging: Assessment of a Critical Software Development Process. InProceedings of the 37th International Conference on Software Engineering. 169–178

2015
[25]

Mayra Sofia Ruiz Rodriguez, SayedHassan Khatoonabadi, and Emad Shihab. 2025. Automated File-Level Logging Generation for Machine Learning Applications using LLMs: A Case Study using GPT-4o Mini. arXiv:2508.04820 [cs.SE] https://arxiv.org/abs/2508.04820

work page arXiv 2025
[26]

Guoping Rong, Shenghui Gu, Haifeng Shen, He Zhang, and Hongyu Kuang. 2023. How Do Developers’ Profiles and Experiences Influence their Logging Practices? An Empirical Study of Industrial Practitioners. In2023 IEEE/ACM 45th International Conference on Software Engineering. 855–867

2023
[27]

Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Siddharth Garg, and Brendan Dolan-Gavitt. 2023. Lost at C: a user study on the security implications of large language model code assistants. InProceedings of the 32nd USENIX Conference on Security Symposium(Anaheim, CA, USA)(SEC ’23). USENIX Association, USA, Article 124, 18 pages

2023
[28]

Weiyi Shang, Meiyappan Nagappan, and Ahmed E. Hassan. 2015. Studying the Relationship between Logging Characteristics and the Code Quality of Platform Software. (2015), 1–27

2015
[29]

Rosalia Tufano, Antonio Mastropaolo, Federica Pepe, Ozren Dabic, Massimiliano Di Penta, and Gabriele Bavota. 2024. Unveiling ChatGPT’s Usage in Open Source Projects: A Mining-based Study. InProceedings of the 21st International Conference on Mining Software Repositories(Lisbon, Portugal)(MSR ’24). Association for Computing Machinery, New York, NY, USA, 57...

work page doi:10.1145/3643991.3644918 2024
[30]

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796 [cs.CL] https://arxiv.org/abs/2404.18796 , Vol. 1, No. 1, Article . Publication date: April 2026. ...

work page arXiv 2024
[31]

Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. 2025. How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations. arXiv:2510.22780 [cs.AI] https://arxiv.org/abs/2510.22780

work page arXiv 2025
[32]

Miku Watanabe, Yutaro Kashiwa, Bin Lin, Toshiki Hirao, Ken’Ichi Yamaguchi, and Hajimu Iida. 2024. On the Use of ChatGPT for Code Review: Do Developers Like Reviews By ChatGPT?. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering(Salerno, Italy)(EASE ’24). Association for Computing Machinery, New York, NY...

work page doi:10.1145/3661167.3661183 2024
[33]

Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E. Hassan. 2025. On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub. arXiv:2509.14745 [cs.SE] https://arxiv.org/abs/2509.14745

work page arXiv 2025
[34]

Junjielong Xu, Ziang Cui, Yuan Zhao, Xu Zhang, Shilin He, Pinjia He, Liqun Li, Yu Kang, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. 2024. UniLog: Automatic Logging via LLM and In-Context Learning. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal). Article 14, 12 pages

2024
[35]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, ...

2024
[36]

Jain, and Michael Stumm

Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: an analysis of production failures in distributed data-intensive systems. InProceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Broomfield, CO)(OS...

2014
[37]

Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage

Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael M. Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage
[38]

InProceedings of the 10th USENIX Conference on Operating Systems Design and Implementation(Hollywood, CA, USA)(OSDI’12)

Be conservative: enhancing failure diagnosis with proactive logging. InProceedings of the 10th USENIX Conference on Operating Systems Design and Implementation(Hollywood, CA, USA)(OSDI’12). 293–306
[39]

D. Yuan, S. Park, and Y. Zhou. 2012. Characterizing logging practices in open-source software. InProc. of the 34th Int. Conf. on Software Engineering. 102–112

2012
[40]

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen
[41]

A survey on large language models for software engineering,

A Survey on Large Language Models for Software Engineering. arXiv:2312.15223 [cs.SE] https://arxiv.org/abs/ 2312.15223

work page arXiv
[42]

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-Following Evaluation for Large Language Models.arXiv preprint arXiv:2311.07911(2023). , Vol. 1, No. 1, Article . Publication date: April 2026

work page internal anchor Pith review Pith/arXiv arXiv 2023