arxiv: 2603.27130 · v2 · submitted 2026-03-28 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

A Large-Scale Empirical Study of AI-Generated Code in Real-World Repositories

Tianhao Mao , Dongfang Zhao , Haixu Tang , Xiaofeng Wang , Hang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:45 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI-generated codeempirical studysoftware repositoriesLLM detectioncode complexitycommit patternsdevelopment practices

0 comments

The pith

AI-generated code in real-world repositories differs from human-written code in complexity, structure, and post-commit evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a large-scale empirical analysis of AI-generated code drawn from actual public software repositories. It measures both code-level traits such as complexity, structural features, and defect indicators, and commit-level traits such as size, activity timing, and how the code changes after the initial commit. A detection pipeline that first applies heuristic filters and then uses LLM classification is used to build the dataset at scale. This approach addresses the limitation of prior studies that relied on small or controlled settings, providing a view of how AI assistance actually operates in ongoing development work. If the differences are real, they supply an empirical basis for understanding the practical effects of AI tools on software quality and team practices.

Core claim

By constructing a large dataset through a heuristic-plus-LLM detection pipeline applied to real repositories, the study establishes that AI-generated code exhibits distinct measurable characteristics relative to conventional human-driven development, including differences in complexity and structural properties at the code level and in size, activity patterns, and evolutionary trajectories at the commit level.

What carries the argument

The detection pipeline that combines heuristic filtering with LLM-based classification to identify AI-generated code and enable large-scale comparative analysis against human-written code.

If this is right

AI-assisted code displays different complexity and structural characteristics than human-written code.
Commits involving AI-generated code show distinct size and activity patterns.
Post-commit evolution of AI code follows different trajectories than human code.
Overall development practices shift measurably when AI assistance is present at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed patterns could be used to calibrate future AI coding models so they better align with human-like structures and maintenance needs.
Repository maintainers and code reviewers may require new processes tailored to the distinct defect and evolution profiles of AI-generated contributions.
Longitudinal tracking of the same repositories could reveal whether the differences grow or shrink as AI tools improve over time.

Load-bearing premise

The heuristic filtering combined with LLM classification accurately identifies AI-generated code at scale with error rates low enough to support valid comparisons of characteristics.

What would settle it

A manual review of a statistically meaningful random sample from the classified set that reveals a high rate of false positives, or a replication using an independent detection method that eliminates the reported differences, would falsify the central comparisons.

Figures

Figures reproduced from arXiv: 2603.27130 by Dongfang Zhao, Haixu Tang, Hang Zhang, Tianhao Mao, Xiaofeng Wang.

**Figure 1.** Figure 1: Measurement Pipeline. limited understanding of how real-world LLM-generated code systematically differs from human-written code, not only in code-level characteristics but also in development activity patterns. Our work addresses this gap by conducting a systematic comparative study of real-world LLM-generated code and human-written code in open-source repositories. Rather than relying on crafted prompts… view at source ↗

**Figure 2.** Figure 2: Distribution of AI-generated code records by tools [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used in software development, generating code that ranges from short snippets to substantial project components. As AI-generated code becomes more common in real-world repositories, it is important to understand how it differs from human-written code and how AI assistance may influence development practices. However, existing studies have largely relied on small-scale or controlled settings, leaving a limited understanding of AI-generated code in the wild. In this work, we present a large-scale empirical study of AI-generated code collected from real-world repositories. We examine both code-level properties, including complexity, structural characteristics, and defect-related indicators, and commit-level characteristics, such as commit size, activity patterns, and post-commit evolution. To support this study, we develop a detection pipeline that combines heuristic filtering with LLM-based classification to identify AI-generated code and construct a large-scale dataset for analysis. Our study provides a comprehensive view of the characteristics of AI-generated code in practice and highlights how AI-assisted development differs from conventional human-driven development. These findings contribute to a better understanding of the real-world impact of AI-assisted programming and offer an empirical basis for future research on AI-generated software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scale on real repos is the draw, but missing detector validation is the main risk.

read the letter

The one thing to know is that this paper tries to give the first large-scale picture of AI-generated code inside real repositories, including how it evolves in commits. The catch is that the detection method has zero reported validation, which makes the whole dataset shaky. They describe a pipeline that uses heuristics to filter candidates and then an LLM to decide if code is AI-written. With that label they build a dataset and compare AI code to human code on complexity, structural traits, defect signals, commit size, activity patterns, and post-commit changes. The abstract claims this shows how AI-assisted work differs from traditional development. What is new is the move to production-scale data and the commit-level tracking. Earlier work stayed small or used controlled experiments, so this dataset construction and the evolution analysis are the actual additions. It does well at framing the problem and outlining the measurements needed. If the labels hold, the findings on defect indicators and evolution could help people designing better coding assistants. The soft spot is the detection reliability. No precision, recall, or agreement numbers appear in the description. LLM classifiers on code can pick up on style rather than origin, so mislabeling is a real risk. That would confound every downstream comparison. The abstract also omits dataset size, how they sampled, and any statistical support, leaving the claims without visible backing. This is for software engineering researchers tracking AI impact and for tool developers who need real-world data. A reader looking for empirical baselines would find the questions useful, though they'd want the methods section to be stronger before relying on the numbers. I would send it for peer review. The scale and the real-repo focus make it worth the referees' time, but they will need to see solid validation for the classifier and proper stats before it can be accepted.

Referee Report

2 major / 1 minor

Summary. The paper presents a large-scale empirical study of AI-generated code in real-world repositories. It develops a detection pipeline that combines heuristic filtering with LLM-based classification to identify AI-generated code, constructs a corresponding dataset, and compares code-level properties (complexity, structural characteristics, defect indicators) and commit-level properties (size, activity patterns, post-commit evolution) against human-written code to highlight differences from conventional development.

Significance. If the detection pipeline proves reliable, the work would offer a valuable large-scale, observational view of AI-assisted coding in production repositories, extending beyond the small-scale or controlled settings of prior studies and supplying an empirical foundation for understanding AI's impact on software development practices.

major comments (2)

[Methods / Detection Pipeline] The detection pipeline (described in the abstract and presumably detailed in the Methods section) is presented as combining heuristic filtering with LLM-based classification, yet no precision, recall, inter-annotator agreement, or error analysis on real commits is supplied. Because every downstream comparison of complexity, defects, commit size, and evolution rests on the fidelity of this labeling, the absence of validation metrics leaves the central observational claims unsupported.
[Results / Dataset Construction] No dataset size, sampling strategy, or statistical details (error bars, confidence intervals, or hypothesis tests) appear in the abstract or summary. Without these, it is impossible to evaluate whether reported differences in code and commit characteristics are robust or could be artifacts of detection errors or selection bias.

minor comments (1)

[Abstract] The abstract would be strengthened by briefly stating the scale of the constructed dataset and one or two headline quantitative findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important gaps in validation and statistical reporting that we will address in the revision to strengthen the reliability of our claims.

read point-by-point responses

Referee: [Methods / Detection Pipeline] The detection pipeline (described in the abstract and presumably detailed in the Methods section) is presented as combining heuristic filtering with LLM-based classification, yet no precision, recall, inter-annotator agreement, or error analysis on real commits is supplied. Because every downstream comparison of complexity, defects, commit size, and evolution rests on the fidelity of this labeling, the absence of validation metrics leaves the central observational claims unsupported.

Authors: We agree that explicit validation metrics are necessary to support the labeling fidelity and all downstream comparisons. The Methods section describes the pipeline components, but we did not include quantitative validation on real commits in the initial submission. In the revised version, we will add a dedicated validation subsection reporting precision, recall, and F1 on a manually annotated sample of 1,000 real commits (with inter-annotator agreement via Cohen's kappa), plus a detailed error analysis categorizing false positives and negatives. This will be accompanied by a new table of metrics. revision: yes
Referee: [Results / Dataset Construction] No dataset size, sampling strategy, or statistical details (error bars, confidence intervals, or hypothesis tests) appear in the abstract or summary. Without these, it is impossible to evaluate whether reported differences in code and commit characteristics are robust or could be artifacts of detection errors or selection bias.

Authors: We acknowledge the need for these details to assess robustness. While the full manuscript (Section 4) describes the overall scale of the dataset and repository sampling, we will revise the Results section to explicitly report exact dataset sizes (repositories, commits, and AI-generated instances), the sampling strategy (random stratified sampling by language and repository size), and statistical details including 95% confidence intervals, error bars on figures, and hypothesis test results (e.g., Mann-Whitney U tests with p-values) for all reported differences. This will mitigate concerns about selection bias or detection artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study with no derivations or self-referential reductions

full rationale

This paper is an empirical observational study that collects and measures code properties and commit characteristics directly from external real-world repositories. No derivation chain, equations, fitted parameters presented as predictions, or first-principles results exist. The detection pipeline is a methodological tool for dataset construction, not a self-defining or fitted input that is then renamed as a prediction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. All claims reduce to direct measurements from the constructed dataset rather than to the paper's own inputs by construction. Limitations around pipeline validation affect data reliability but do not constitute circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unvalidated accuracy of the LLM-based detector and on standard assumptions of empirical software engineering studies; no free parameters, new entities, or non-standard axioms are introduced in the abstract.

axioms (1)

domain assumption LLM-based classifiers can be combined with heuristics to produce reliable labels for AI-generated code at repository scale
Invoked to justify the detection pipeline that underpins the entire dataset

pith-pipeline@v0.9.0 · 5516 in / 1097 out tokens · 37063 ms · 2026-05-14T22:45:00.905965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

[1]

Maurício Aniche. 2026.CK. https://github.com/mauricioaniche/ck Accessed: 2026-03-26

work page 2026
[2]

ISBN 979-8-4007-1895-3

Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, and Nor- bert Tihanyi. 2026. I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution. InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security (A...

work page doi:10.1145/3733799.3762964 2026
[3]

Black, Bhaskar P

Gavin S. Black, Bhaskar P. Rimal, and Varghese Mathew Vaidyan. 2025. Balancing Security and Correctness in Code Generation: An Empirical Study on Commercial Large Language Models.IEEE Transactions on Emerging Topics in Computational Intelligence9, 1 (2025), 419–430. doi:10.1109/TETCI.2024.3446695

work page doi:10.1109/tetci.2024.3446695 2025
[4]

Hongbo Chen, Yifan Zhang, Xing Han, Tianhao Mao, Huanyao Rong, Yuheng Zhang, XiaoFeng Wang, Luyi Xing, Xun Chen, and Hang Zhang. 2025. Line- Breaker: Finding Token-Inconsistency Bugs with Large Language Models. In 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 893–905. doi:10.1109/ASE63991.2025.00079

work page doi:10.1109/ase63991.2025.00079 2025
[5]

2026.Pygments

Pygments contributors. 2026.Pygments. https://pygments.org/ Accessed: 2026- 03-26

work page 2026
[6]

2026.cloc: v2.08

Albert Danial. 2026.cloc: v2.08. doi:10.5281/zenodo.5760077

work page doi:10.5281/zenodo.5760077 2026
[7]

Simone Daniotti, Johannes Wachs, Xiangnan Feng, and Frank Neffke. 2026. Who is using AI to code? Global diffusion and impact of generative AI.Science391, 6787 (2026), 831–835. doi:10.1126/science.adz9311

work page doi:10.1126/science.adz9311 2026
[8]

Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu, and Jinfu Chen. 2025. Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study.ACM Trans. Softw. Eng. Methodol.34, 8, Article 218 (Oct. 2025), 34 pages. doi:10.1145/3716848

work page doi:10.1145/3716848 2025
[9]

2025.CodeQL

GitHub. 2025.CodeQL. https://github.com/github/codeql

work page 2025
[10]

Jiaxun Guo, Ziyuan Yang, Mengyu Sun, Hui Wang, Jingfeng Lu, and Yi Zhang

work page
[11]

arXiv:2603.04212 [cs.SE] https://arxiv.org/abs/2603.04212

Code Fingerprints: Disentangled Attribution of LLM-Generated Code. arXiv:2603.04212 [cs.SE] https://arxiv.org/abs/2603.04212

work page arXiv
[12]

Hagberg, Daniel A

Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. 2008. Exploring Network Structure, Dynamics, and Function using NetworkX. InProceedings of the 7th Python in Science Conference. 11–15. doi:10.25080/TCWV9851

work page doi:10.25080/tcwv9851 2008
[13]

S M Mahedy Hasan, Md Fazle Rabbi, and Minhaz Zibran. 2026. The Quiet Contri- butions: Insights into AI-Generated Silent Pull Requests. arXiv:2601.21102 [cs.SE] https://arxiv.org/abs/2601.21102 Mining Challenge track of the 23rd International Conference on Mining Software Repositories (MSR 2026)

work page arXiv 2026
[14]

Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security Hardening and Adversarial Testing. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(Copenhagen, Denmark) (CCS ’23). Association for Computing Machinery, New York, NY, USA, 1865–1879. doi:10.1145/3576915.3623175

work page doi:10.1145/3576915.3623175 2023
[15]

2026.Joern: The Bug Hunter’s Workbench

joern.io. 2026.Joern: The Bug Hunter’s Workbench. https://github.com/joernio/ joern

work page 2026
[16]

2026.jscpd

jscpd contributors. 2026.jscpd. https://github.com/kucherenko/jscpd Accessed: 2026-03-26

work page 2026
[17]

Avila, Jacob Brunelle, and Baba Mamadou Camara

Raphaël Khoury, Anderson R. Avila, Jacob Brunelle, and Baba Mamadou Camara

work page
[18]

In2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

How Secure is Code Generated by ChatGPT?. In2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 2445–2451. doi:10.1109/ SMC53992.2023.10394237

work page arXiv 2023
[19]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshap- ing Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/abs/2507. 15003

work page arXiv 2025
[20]

Shuang Li, Yuntao Cheng, Jinfu Chen, Jifeng Xuan, Sen He, and Weiyi Shang

work page
[21]

Engg.31, 3 (Jan

Performance analysis of AI-generated code: A case study of Copilot, Copilot Chat, CodeLlaMa, and DeepSeek-Coder models.Empirical Softw. Engg.31, 3 (Jan. 2026), 52 pages. doi:10.1007/s10664-025-10776-1

work page doi:10.1007/s10664-025-10776-1 2026
[22]

Jie Lin and David Mohaisen. 2025. From Large to Mammoth: A Com- parative Evaluation of Large Language Models in Vulnerability Detection. In32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28, 2025. The Internet Soci- ety. https://www.ndss-symposium.org/ndss-paper/from-large-to-mammoth-a- com...

work page 2025
[23]

H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other.The Annals of Mathematical Statistics18, 1 (1947), 50 – 60. doi:10.1214/aoms/1177730491

work page doi:10.1214/aoms/1177730491 1947
[24]

Quinn McNemar. 1947. Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages.Psychometrika12, 2 (1947), 153–157. doi:10.1007/BF02295996

work page doi:10.1007/bf02295996 1947
[25]

Sarker, Leandros Maglaras, and Naeem Janjua

Ahmad Mohsin, Helge Janicke, Adrian Wood, Iqbal H. Sarker, Leandros Maglaras, and Naeem Janjua. 2024. Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs. arXiv:2406.12513 [cs.CR] https://arxiv.org/abs/2406.12513

work page arXiv 2024
[26]

Alfred Santa Molison, Marcia Moraes, Glaucia Melo, Fabio Santos, and Wesley K. G. Assunção. 2025. Is LLM-Generated Code More Maintainable & Reliable Than Human-Written Code?. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 151–162. doi:10.1109/ ESEM64174.2025.00036

work page arXiv 2025
[27]

Daniil Orel, Dilshod Azizov, and Preslav Nakov. 2025. CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pile- hvar (Eds.). Association for Computational Linguistics,...

work page doi:10.18653/v1/2025.findings-acl.550 2025
[28]

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In2022 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 754–768

work page 2022
[29]

Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157–175. doi:10.1080/14786440009463897

work page doi:10.1080/14786440009463897 1900
[30]

Musfiqur Rahman, SayedHassan Khatoonabadi, Ahmad Abdellatif, and Emad Shihab. 2025. Automatic Detection of LLM-Generated Code: A Comparative Case Study of Contemporary Models Across Function and Class Granularities. arXiv:2409.01382 [cs.SE] https://arxiv.org/abs/2409.01382

work page arXiv 2025
[31]

Romain Robbes, Théo Matricon, Thomas Degueule, Andre Hora, and Stefano Zacchiroli. 2026. Agentic Much? Adoption of Coding Agents on GitHub. arXiv:2601.18341 [cs.SE] https://arxiv.org/abs/2601.18341

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Amirali Sajadi, Kostadin Damevski, and Preetha Chatterjee. 2025. How Safe Are AI-Generated Patches? A Large-scale Study on Security Risks in LLM and Agentic Automated Program Repair on SWE-bench. arXiv:2507.02976 [cs.CR] https://arxiv.org/abs/2507.02976

work page arXiv 2025
[33]

Andreas Schaad, Stefan Götz, and Dominik Binder. 2025. You Still have to Study On the Security of LLM Generated Code. InICT Systems Security and Privacy Protection, Lili Nemec Zlatolas, Kai Rannenberg, Tatjana Welzer, and Joaquin Garcia-Alfaro (Eds.). Springer Nature Switzerland, Cham, 111–124

work page 2025
[34]

2025.Security Vulnerabilities in AI- Generated Code: A Large-Scale Analysis of Public GitHub Repositories

Maximilian Schreiber and Pascal Tippe. 2025.Security Vulnerabilities in AI- Generated Code: A Large-Scale Analysis of Public GitHub Repositories. Springer Nature Singapore, 153–172. doi:10.1007/978-981-95-3537-8_9

work page doi:10.1007/978-981-95-3537-8_9 2025
[35]

2026.Understand

SciTools. 2026.Understand. https://scitools.com/ Accessed: 2026-03-26

work page 2026
[36]

Mohammed Latif Siddiq, Joanna Cecilia da Silva Santos, Sajith Devareddy, and Anna Muller. 2024. SALLM: Security Assessment of Generated Code. InPro- ceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW ’24). ACM, 54–65. doi:10.1145/3691621.3694934

work page doi:10.1145/3691621.3694934 2024
[37]

Mohammed Latif Siddiq, Xinye Zhao, Vinicius Carvalho Lopes, Beatrice Casey, and Joanna C. S. Santos. 2026. Security in the Age of AI Teammates: An Empirical Study of Agentic Pull Requests on GitHub. arXiv:2601.00477 [cs.CR] https: //arxiv.org/abs/2601.00477

work page arXiv 2026
[38]

Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed. 2025. Calibration and Correctness of Language Models for Code. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 540–552. doi:10....

work page doi:10.1109/icse55347.2025.00040 2025
[39]

Hyunjae Suh, Mahan Tafreshipour, Jiawei Li, Adithya Bhattiprolu, and Iftekhar Ahmed. 2025. An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 859–871. doi:10.1109/ICSE55347.2025.00064

work page doi:10.1109/icse55347.2025.00064 2025
[40]

2025.tree-sitter

tree-sitter contributors. 2025.tree-sitter. https://github.com/tree-sitter/tree-sitter

work page 2025
[41]

Jiexin Wang, Xitong Luo, Liuwen Cao, Hongkui He, Hailin Huang, Jiayuan Xie, Adam Jatowt, and Yi Cai. 2024. Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval. arXiv:2407.02395 [cs.SE] https://arxiv.org/abs/2407.02395

work page arXiv 2024
[42]

Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods.Biometrics Bulletin1, 6 (1945), 80–83. http://www.jstor.org/stable/3001968 Conference’17, July 2017, Washington, DC, USA Tianhao Mao, Dongfang Zhao, Haixu Tang, Xiaofeng Wang, and Hang Zhang

work page arXiv 1945
[43]

Tao Xiao, Youmei Fan, Fabio Calefato, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, and Sebastian Baltes. 2026. Self-Admitted GenAI Usage in Open- Source Software. arXiv:2507.10422 [cs.SE] https://arxiv.org/abs/2507.10422

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Yuliang Xu, Siming Huang, Mingmeng Geng, Yao Wan, Xuanhua Shi, and Dong- ping Chen. 2026. code-transformed: The Influence of Large Language Models on Code. InFindings of the Association for Computational Linguistics: EACL 2026, Vera Demberg, Kentaro Inui, and Lluís Marquez (Eds.). Association for Computational Linguistics, Rabat, Morocco, 5462–5490. doi:1...

work page doi:10.18653/v1/2026.findings-eacl.290 2026
[45]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Prag- matic Code Generation with Generative Pretrained Models . In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 428–439. doi:10.1145/...

work page doi:10.1145/3597503.3623322 2024
[46]

Beiqi Zhang, Peng Liang, Qiong Feng, Yujia Fu, and Zengyang Li. 2024. Copilot- in-the-Loop: Fixing Code Smells in Copilot-Generated Python Code using Copilot. InProceedings of the 39th IEEE/ACM International Conference on Automated Soft- ware Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 2230–2234. doi:...

work page doi:10.1145/3691620.3695290 2024