Recognition: 2 theorem links
· Lean TheoremA Large-Scale Empirical Study of AI-Generated Code in Real-World Repositories
Pith reviewed 2026-05-14 22:45 UTC · model grok-4.3
The pith
AI-generated code in real-world repositories differs from human-written code in complexity, structure, and post-commit evolution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing a large dataset through a heuristic-plus-LLM detection pipeline applied to real repositories, the study establishes that AI-generated code exhibits distinct measurable characteristics relative to conventional human-driven development, including differences in complexity and structural properties at the code level and in size, activity patterns, and evolutionary trajectories at the commit level.
What carries the argument
The detection pipeline that combines heuristic filtering with LLM-based classification to identify AI-generated code and enable large-scale comparative analysis against human-written code.
If this is right
- AI-assisted code displays different complexity and structural characteristics than human-written code.
- Commits involving AI-generated code show distinct size and activity patterns.
- Post-commit evolution of AI code follows different trajectories than human code.
- Overall development practices shift measurably when AI assistance is present at scale.
Where Pith is reading between the lines
- The observed patterns could be used to calibrate future AI coding models so they better align with human-like structures and maintenance needs.
- Repository maintainers and code reviewers may require new processes tailored to the distinct defect and evolution profiles of AI-generated contributions.
- Longitudinal tracking of the same repositories could reveal whether the differences grow or shrink as AI tools improve over time.
Load-bearing premise
The heuristic filtering combined with LLM classification accurately identifies AI-generated code at scale with error rates low enough to support valid comparisons of characteristics.
What would settle it
A manual review of a statistically meaningful random sample from the classified set that reveals a high rate of false positives, or a replication using an independent detection method that eliminates the reported differences, would falsify the central comparisons.
Figures
read the original abstract
Large language models (LLMs) are increasingly used in software development, generating code that ranges from short snippets to substantial project components. As AI-generated code becomes more common in real-world repositories, it is important to understand how it differs from human-written code and how AI assistance may influence development practices. However, existing studies have largely relied on small-scale or controlled settings, leaving a limited understanding of AI-generated code in the wild. In this work, we present a large-scale empirical study of AI-generated code collected from real-world repositories. We examine both code-level properties, including complexity, structural characteristics, and defect-related indicators, and commit-level characteristics, such as commit size, activity patterns, and post-commit evolution. To support this study, we develop a detection pipeline that combines heuristic filtering with LLM-based classification to identify AI-generated code and construct a large-scale dataset for analysis. Our study provides a comprehensive view of the characteristics of AI-generated code in practice and highlights how AI-assisted development differs from conventional human-driven development. These findings contribute to a better understanding of the real-world impact of AI-assisted programming and offer an empirical basis for future research on AI-generated software.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a large-scale empirical study of AI-generated code in real-world repositories. It develops a detection pipeline that combines heuristic filtering with LLM-based classification to identify AI-generated code, constructs a corresponding dataset, and compares code-level properties (complexity, structural characteristics, defect indicators) and commit-level properties (size, activity patterns, post-commit evolution) against human-written code to highlight differences from conventional development.
Significance. If the detection pipeline proves reliable, the work would offer a valuable large-scale, observational view of AI-assisted coding in production repositories, extending beyond the small-scale or controlled settings of prior studies and supplying an empirical foundation for understanding AI's impact on software development practices.
major comments (2)
- [Methods / Detection Pipeline] The detection pipeline (described in the abstract and presumably detailed in the Methods section) is presented as combining heuristic filtering with LLM-based classification, yet no precision, recall, inter-annotator agreement, or error analysis on real commits is supplied. Because every downstream comparison of complexity, defects, commit size, and evolution rests on the fidelity of this labeling, the absence of validation metrics leaves the central observational claims unsupported.
- [Results / Dataset Construction] No dataset size, sampling strategy, or statistical details (error bars, confidence intervals, or hypothesis tests) appear in the abstract or summary. Without these, it is impossible to evaluate whether reported differences in code and commit characteristics are robust or could be artifacts of detection errors or selection bias.
minor comments (1)
- [Abstract] The abstract would be strengthened by briefly stating the scale of the constructed dataset and one or two headline quantitative findings.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important gaps in validation and statistical reporting that we will address in the revision to strengthen the reliability of our claims.
read point-by-point responses
-
Referee: [Methods / Detection Pipeline] The detection pipeline (described in the abstract and presumably detailed in the Methods section) is presented as combining heuristic filtering with LLM-based classification, yet no precision, recall, inter-annotator agreement, or error analysis on real commits is supplied. Because every downstream comparison of complexity, defects, commit size, and evolution rests on the fidelity of this labeling, the absence of validation metrics leaves the central observational claims unsupported.
Authors: We agree that explicit validation metrics are necessary to support the labeling fidelity and all downstream comparisons. The Methods section describes the pipeline components, but we did not include quantitative validation on real commits in the initial submission. In the revised version, we will add a dedicated validation subsection reporting precision, recall, and F1 on a manually annotated sample of 1,000 real commits (with inter-annotator agreement via Cohen's kappa), plus a detailed error analysis categorizing false positives and negatives. This will be accompanied by a new table of metrics. revision: yes
-
Referee: [Results / Dataset Construction] No dataset size, sampling strategy, or statistical details (error bars, confidence intervals, or hypothesis tests) appear in the abstract or summary. Without these, it is impossible to evaluate whether reported differences in code and commit characteristics are robust or could be artifacts of detection errors or selection bias.
Authors: We acknowledge the need for these details to assess robustness. While the full manuscript (Section 4) describes the overall scale of the dataset and repository sampling, we will revise the Results section to explicitly report exact dataset sizes (repositories, commits, and AI-generated instances), the sampling strategy (random stratified sampling by language and repository size), and statistical details including 95% confidence intervals, error bars on figures, and hypothesis test results (e.g., Mann-Whitney U tests with p-values) for all reported differences. This will mitigate concerns about selection bias or detection artifacts. revision: yes
Circularity Check
No circularity: purely observational empirical study with no derivations or self-referential reductions
full rationale
This paper is an empirical observational study that collects and measures code properties and commit characteristics directly from external real-world repositories. No derivation chain, equations, fitted parameters presented as predictions, or first-principles results exist. The detection pipeline is a methodological tool for dataset construction, not a self-defining or fitted input that is then renamed as a prediction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. All claims reduce to direct measurements from the constructed dataset rather than to the paper's own inputs by construction. Limitations around pipeline validation affect data reliability but do not constitute circularity in any derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based classifiers can be combined with heuristics to produce reliable labels for AI-generated code at repository scale
Reference graph
Works this paper leans on
-
[1]
Maurício Aniche. 2026.CK. https://github.com/mauricioaniche/ck Accessed: 2026-03-26
work page 2026
-
[2]
Tamas Bisztray, Bilel Cherif, Richard A. Dubniczky, Nils Gruschka, Bertalan Borsos, Mohamed Amine Ferrag, Attila Kovacs, Vasileios Mavroeidis, and Nor- bert Tihanyi. 2026. I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution. InProceedings of the 18th ACM Workshop on Artificial Intelligence and Security (A...
-
[3]
Gavin S. Black, Bhaskar P. Rimal, and Varghese Mathew Vaidyan. 2025. Balancing Security and Correctness in Code Generation: An Empirical Study on Commercial Large Language Models.IEEE Transactions on Emerging Topics in Computational Intelligence9, 1 (2025), 419–430. doi:10.1109/TETCI.2024.3446695
-
[4]
Hongbo Chen, Yifan Zhang, Xing Han, Tianhao Mao, Huanyao Rong, Yuheng Zhang, XiaoFeng Wang, Luyi Xing, Xun Chen, and Hang Zhang. 2025. Line- Breaker: Finding Token-Inconsistency Bugs with Large Language Models. In 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 893–905. doi:10.1109/ASE63991.2025.00079
-
[5]
Pygments contributors. 2026.Pygments. https://pygments.org/ Accessed: 2026- 03-26
work page 2026
-
[6]
Albert Danial. 2026.cloc: v2.08. doi:10.5281/zenodo.5760077
-
[7]
Simone Daniotti, Johannes Wachs, Xiangnan Feng, and Frank Neffke. 2026. Who is using AI to code? Global diffusion and impact of generative AI.Science391, 6787 (2026), 831–835. doi:10.1126/science.adz9311
-
[8]
Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu, and Jinfu Chen. 2025. Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study.ACM Trans. Softw. Eng. Methodol.34, 8, Article 218 (Oct. 2025), 34 pages. doi:10.1145/3716848
- [9]
-
[10]
Jiaxun Guo, Ziyuan Yang, Mengyu Sun, Hui Wang, Jingfeng Lu, and Yi Zhang
-
[11]
arXiv:2603.04212 [cs.SE] https://arxiv.org/abs/2603.04212
Code Fingerprints: Disentangled Attribution of LLM-Generated Code. arXiv:2603.04212 [cs.SE] https://arxiv.org/abs/2603.04212
-
[12]
Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. 2008. Exploring Network Structure, Dynamics, and Function using NetworkX. InProceedings of the 7th Python in Science Conference. 11–15. doi:10.25080/TCWV9851
-
[13]
S M Mahedy Hasan, Md Fazle Rabbi, and Minhaz Zibran. 2026. The Quiet Contri- butions: Insights into AI-Generated Silent Pull Requests. arXiv:2601.21102 [cs.SE] https://arxiv.org/abs/2601.21102 Mining Challenge track of the 23rd International Conference on Mining Software Repositories (MSR 2026)
-
[14]
Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security Hardening and Adversarial Testing. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(Copenhagen, Denmark) (CCS ’23). Association for Computing Machinery, New York, NY, USA, 1865–1879. doi:10.1145/3576915.3623175
-
[15]
2026.Joern: The Bug Hunter’s Workbench
joern.io. 2026.Joern: The Bug Hunter’s Workbench. https://github.com/joernio/ joern
work page 2026
-
[16]
jscpd contributors. 2026.jscpd. https://github.com/kucherenko/jscpd Accessed: 2026-03-26
work page 2026
-
[17]
Avila, Jacob Brunelle, and Baba Mamadou Camara
Raphaël Khoury, Anderson R. Avila, Jacob Brunelle, and Baba Mamadou Camara
-
[18]
In2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
How Secure is Code Generated by ChatGPT?. In2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 2445–2451. doi:10.1109/ SMC53992.2023.10394237
- [19]
-
[20]
Shuang Li, Yuntao Cheng, Jinfu Chen, Jifeng Xuan, Sen He, and Weiyi Shang
-
[21]
Performance analysis of AI-generated code: A case study of Copilot, Copilot Chat, CodeLlaMa, and DeepSeek-Coder models.Empirical Softw. Engg.31, 3 (Jan. 2026), 52 pages. doi:10.1007/s10664-025-10776-1
-
[22]
Jie Lin and David Mohaisen. 2025. From Large to Mammoth: A Com- parative Evaluation of Large Language Models in Vulnerability Detection. In32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28, 2025. The Internet Soci- ety. https://www.ndss-symposium.org/ndss-paper/from-large-to-mammoth-a- com...
work page 2025
-
[23]
H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other.The Annals of Mathematical Statistics18, 1 (1947), 50 – 60. doi:10.1214/aoms/1177730491
-
[24]
Quinn McNemar. 1947. Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages.Psychometrika12, 2 (1947), 153–157. doi:10.1007/BF02295996
-
[25]
Sarker, Leandros Maglaras, and Naeem Janjua
Ahmad Mohsin, Helge Janicke, Adrian Wood, Iqbal H. Sarker, Leandros Maglaras, and Naeem Janjua. 2024. Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs. arXiv:2406.12513 [cs.CR] https://arxiv.org/abs/2406.12513
-
[26]
Alfred Santa Molison, Marcia Moraes, Glaucia Melo, Fabio Santos, and Wesley K. G. Assunção. 2025. Is LLM-Generated Code More Maintainable & Reliable Than Human-Written Code?. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 151–162. doi:10.1109/ ESEM64174.2025.00036
-
[27]
Daniil Orel, Dilshod Azizov, and Preslav Nakov. 2025. CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pile- hvar (Eds.). Association for Computational Linguistics,...
-
[28]
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In2022 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 754–768
work page 2022
-
[29]
Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157–175. doi:10.1080/14786440009463897
- [30]
-
[31]
Romain Robbes, Théo Matricon, Thomas Degueule, Andre Hora, and Stefano Zacchiroli. 2026. Agentic Much? Adoption of Coding Agents on GitHub. arXiv:2601.18341 [cs.SE] https://arxiv.org/abs/2601.18341
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [32]
-
[33]
Andreas Schaad, Stefan Götz, and Dominik Binder. 2025. You Still have to Study On the Security of LLM Generated Code. InICT Systems Security and Privacy Protection, Lili Nemec Zlatolas, Kai Rannenberg, Tatjana Welzer, and Joaquin Garcia-Alfaro (Eds.). Springer Nature Switzerland, Cham, 111–124
work page 2025
-
[34]
Maximilian Schreiber and Pascal Tippe. 2025.Security Vulnerabilities in AI- Generated Code: A Large-Scale Analysis of Public GitHub Repositories. Springer Nature Singapore, 153–172. doi:10.1007/978-981-95-3537-8_9
-
[35]
SciTools. 2026.Understand. https://scitools.com/ Accessed: 2026-03-26
work page 2026
-
[36]
Mohammed Latif Siddiq, Joanna Cecilia da Silva Santos, Sajith Devareddy, and Anna Muller. 2024. SALLM: Security Assessment of Generated Code. InPro- ceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW ’24). ACM, 54–65. doi:10.1145/3691621.3694934
- [37]
-
[38]
Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, and Toufique Ahmed. 2025. Calibration and Correctness of Language Models for Code. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 540–552. doi:10....
-
[39]
Hyunjae Suh, Mahan Tafreshipour, Jiawei Li, Adithya Bhattiprolu, and Iftekhar Ahmed. 2025. An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 859–871. doi:10.1109/ICSE55347.2025.00064
-
[40]
tree-sitter contributors. 2025.tree-sitter. https://github.com/tree-sitter/tree-sitter
work page 2025
-
[41]
Jiexin Wang, Xitong Luo, Liuwen Cao, Hongkui He, Hailin Huang, Jiayuan Xie, Adam Jatowt, and Yi Cai. 2024. Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval. arXiv:2407.02395 [cs.SE] https://arxiv.org/abs/2407.02395
- [42]
-
[43]
Tao Xiao, Youmei Fan, Fabio Calefato, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, and Sebastian Baltes. 2026. Self-Admitted GenAI Usage in Open- Source Software. arXiv:2507.10422 [cs.SE] https://arxiv.org/abs/2507.10422
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
Yuliang Xu, Siming Huang, Mingmeng Geng, Yao Wan, Xuanhua Shi, and Dong- ping Chen. 2026. code-transformed: The Influence of Large Language Models on Code. InFindings of the Association for Computational Linguistics: EACL 2026, Vera Demberg, Kentaro Inui, and Lluís Marquez (Eds.). Association for Computational Linguistics, Rabat, Morocco, 5462–5490. doi:1...
-
[45]
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. CoderEval: A Benchmark of Prag- matic Code Generation with Generative Pretrained Models . In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 428–439. doi:10.1145/...
-
[46]
Beiqi Zhang, Peng Liang, Qiong Feng, Yujia Fu, and Zengyang Li. 2024. Copilot- in-the-Loop: Fixing Code Smells in Copilot-Generated Python Code using Copilot. InProceedings of the 39th IEEE/ACM International Conference on Automated Soft- ware Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 2230–2234. doi:...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.