arxiv: 2605.05267 · v1 · submitted 2026-05-06 · 💻 cs.SE · cs.AI

Recognition: unknown

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

Bihuan Chen, Chong Wang, Kaifeng He, Kaifeng Huang, Mingwei Liu, Peiliang Cai, Xiaojun Zhang, Xin Peng, Yanlin Wang, Zibin Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:33 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLMs for codesystematic reviewcode generationtraining data qualitycode quality issuescausal frameworkpropagation mechanismsquality assurance

0 comments

The pith

Training data flaws propagate into LLM-generated code defects through 18 specific causal paths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This systematic review of 114 studies shows how imperfections in training corpora lead to defects such as bugs and vulnerabilities in code produced by large language models. It builds a taxonomy that groups generated code problems into nine dimensions and training data problems into code and non-code attributes, then maps 18 typical ways the data issues cause the output issues. The work matters because it reframes generation failures as traceable data problems rather than inherent model limits. It also surveys detection and mitigation methods and notes a shift toward fixing issues earlier through data governance instead of after-the-fact filtering.

Core claim

We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. The reviewed literature reveals a clear methodological shift: quality assurance is transitioning from reactive, heuristic-based post-generation filtering toward proactive, data-centric governance and closed-loop repair.

What carries the argument

The causal framework that details 18 propagation mapping mechanisms linking training data quality issues to generated code quality issues.

If this is right

Detection and mitigation of quality issues can occur at the data curation stage, the model training stage, or during generation rather than only after output.
Quality assurance practices move from reactive filtering of bad outputs to proactive governance of training data and closed-loop repair systems.
Integrated approaches combining data curation with continuous evaluation support the development of more reliable LLMs for code.
Open challenges remain in fully addressing all identified propagation paths and in scaling the taxonomy to new model architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy offers a checklist that practitioners could apply when auditing existing training datasets to flag risks before model training begins.
The 18 mechanisms provide a basis for building predictive tools that estimate the likelihood of specific code defects given measurable properties of the training data.
The framework could generalize beyond code to other structured generation tasks where data quality directly affects output correctness.

Load-bearing premise

The 114 primary studies selected for the review comprehensively and without significant bias represent the mechanisms by which training data quality issues propagate to generated code quality issues.

What would settle it

A controlled study that produces LLM-generated code defects whose root causes cannot be mapped to any of the 18 propagation mechanisms or to the defined categories of training data quality issues.

Figures

Figures reproduced from arXiv: 2605.05267 by Bihuan Chen, Chong Wang, Kaifeng He, Kaifeng Huang, Mingwei Liu, Peiliang Cai, Xiaojun Zhang, Xin Peng, Yanlin Wang, Zibin Zheng.

**Figure 1.** Figure 1: LLM-generated code with multiple quality defects. view at source ↗

**Figure 2.** Figure 2: Training data quality issues propagated to generated code. view at source ↗

**Figure 3.** Figure 3: Conceptual framework of quality issues and mitigation in the LLM lifecycle. view at source ↗

**Figure 4.** Figure 4: Overview of the process of paper collection and filtering. view at source ↗

**Figure 5.** Figure 5: Cumulative number of included studies by publication period. view at source ↗

**Figure 6.** Figure 6: Distribution of Included Studies by Quality Score. view at source ↗

**Figure 8.** Figure 8: Taxonomy of generated code quality issues with corresponding literature references. view at source ↗

**Figure 9.** Figure 9: Temporal distribution of studies across nine quality dimensions (bubble size denotes the number of studies in the corresponding view at source ↗

**Figure 10.** Figure 10: Taxonomy of training data quality issues with corresponding literature references. view at source ↗

**Figure 11.** Figure 11: Temporal distribution of studies across code and non-code attribute quality dimensions (bubble size denotes the number of view at source ↗

**Figure 12.** Figure 12: A Sankey diagram illustrating the mappings from training data quality issues (left) to generated code quality issues (right). view at source ↗

**Figure 13.** Figure 13: Taxonomy of generated code quality issue detection techniques with corresponding literature references. view at source ↗

**Figure 14.** Figure 14: Taxonomy of training data quality issue detection techniques with corresponding literature references. view at source ↗

**Figure 15.** Figure 15: Taxonomy of generated code quality issue mitigation strategies with corresponding literature references. view at source ↗

**Figure 16.** Figure 16: Taxonomy of training data quality issue mitigation strategies with corresponding literature references. view at source ↗

read the original abstract

Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore, we synthesize state-of-the-art detection and mitigation techniques across the data, model, and generation lifecycles. The reviewed literature reveals a clear methodological shift: quality assurance is transitioning from reactive, heuristic-based post-generation filtering toward proactive, data-centric governance and closed-loop repair. Finally, we identify open challenges and outline research directions for developing reliable LLMs for code through integrated data curation and continuous evaluation. Our repository is available at https://github.com/SYSUSELab/From-Data-to-Code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This review pulls 114 studies into a new taxonomy of code quality issues and 18 mappings from training data defects, giving a practical structure to an important but scattered area.

read the letter

The main point is that the authors have built a unified taxonomy for quality problems in LLM-generated code across nine dimensions and linked it to training data issues through a causal framework of 18 typical propagation mappings. They draw this from a systematic review of 114 primary studies and also cover detection and mitigation approaches across data, model, and generation stages. They note a shift in the field toward proactive data curation rather than just fixing outputs after the fact, and they provide a public repository for the sources.

Referee Report

1 major / 4 minor

Summary. This paper conducts a systematic literature review of 114 primary studies to examine how training data quality issues propagate into code generation quality issues in large language models. It constructs a unified taxonomy that organizes generated code quality issues along nine dimensions and training data quality issues into code and non-code attributes, then derives a causal framework consisting of 18 typical propagation mapping mechanisms. The review also synthesizes detection and mitigation techniques across the data, model, and generation stages, documents a shift toward proactive data-centric quality assurance, and identifies open challenges and research directions, with an accompanying public repository.

Significance. If the synthesized taxonomy and mappings accurately reflect the literature, the paper offers a useful organizational contribution that consolidates disparate findings on LLM code quality into a single framework. This can help researchers trace root causes from training data to generation failures and prioritize data governance over post-hoc fixes. The public repository supports reproducibility and further extension of the review.

major comments (1)

[§3] §3 (Methodology): The review reports selecting 114 primary studies but provides only a high-level summary of the search strategy, inclusion criteria, and screening process. Explicit details on database queries, exact inclusion/exclusion rules, and any inter-rater reliability statistics are needed to evaluate selection bias and confirm that the taxonomy and 18 mappings comprehensively represent the literature without significant omissions.

minor comments (4)

[§4.1] §4.1 and Figure 2: The nine dimensions of the generated code quality taxonomy are described narratively; an explicit summary table listing each dimension with a concise definition and one or two representative examples from the reviewed studies would improve clarity and usability.
[§5] §5 (Causal framework): The 18 propagation mappings are labeled as 'causal' yet derived from observational literature; adding a short discussion of the correlational versus causal nature of the evidence and any noted confounding factors would strengthen the framework's interpretation.
[§6] Repository and §6: The GitHub link is provided, but the paper should state precisely which artifacts (full study list, taxonomy codes, mapping details) are included and confirm they remain accessible.
Throughout: Terminology is generally consistent, but ensure 'causal framework' is used uniformly with the qualifier 'typical propagation mapping mechanisms' to avoid implying primary causal inference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of our systematic review and for the constructive suggestion to strengthen the methodological transparency. We agree that additional details will improve the paper's reproducibility and allow better evaluation of the taxonomy and mappings. We will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Methodology): The review reports selecting 114 primary studies but provides only a high-level summary of the search strategy, inclusion criteria, and screening process. Explicit details on database queries, exact inclusion/exclusion rules, and any inter-rater reliability statistics are needed to evaluate selection bias and confirm that the taxonomy and 18 mappings comprehensively represent the literature without significant omissions.

Authors: We thank the referee for this observation. While §3 currently summarizes the overall search strategy, inclusion criteria, and screening process at a high level to preserve readability, we acknowledge that explicit details are required for rigorous assessment of selection bias and coverage. In the revised version, we will expand §3 to include: (1) the exact search strings and queries executed across each database (e.g., IEEE Xplore, ACM Digital Library, arXiv, Scopus); (2) the complete, itemized inclusion and exclusion criteria with justifications; (3) a detailed PRISMA-style flow diagram showing the number of papers at each screening stage; and (4) inter-rater reliability statistics (Cohen's kappa or equivalent) for the independent screening and data extraction phases performed by the authors. These additions will be placed in §3 and, where appropriate, referenced in the repository. We believe this will fully address the concern while maintaining the paper's focus on the synthesized taxonomies and propagation mechanisms. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This paper is a systematic literature review that synthesizes external primary studies (114 papers) to construct a taxonomy of code quality issues and a causal framework of propagation mappings. No internal equations, fitted parameters, self-definitional loops, or load-bearing self-citations exist; the taxonomy and 18 mappings are presented as observed patterns from the reviewed literature rather than derived by construction from the paper's own inputs. The methodology is standard for reviews and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a literature synthesis paper; it introduces no new fitted parameters, mathematical axioms, or postulated entities beyond standard systematic review practices.

axioms (1)

domain assumption The 114 primary studies form a representative sample of research on training data and generated code quality issues in LLMs
The taxonomy and 18 mappings rest on the completeness and lack of bias in the selected studies.

pith-pipeline@v0.9.0 · 5546 in / 1245 out tokens · 52399 ms · 2026-05-08T17:33:27.406431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

156 extracted references · 141 canonical work pages · 22 internal anchors

[1]

Altaf Allah Abbassi, Leuson Da Silva, Amin Nikanjam, and Foutse Khomh. 2025. A Taxonomy of Inefficiencies in LLM-Generated Python Code. arXiv:2503.06327 [cs.SE] https://arxiv.org/abs/2503.06327

work page arXiv 2025
[2]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...

work page internal anchor Pith review arXiv 2024
[3]

Vibhor Agarwal, Yulong Pei, Salwa Alamir, and Xiaomo Liu. 2025. CodeMirage: Hallucinations in Code Generated by Large Language Models. arXiv:2408.08333 [cs.SE] https://arxiv.org/abs/2408.08333

work page arXiv 2025
[4]

Barr, Premkumar Devanbu, and Charles Sut- ton

Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. arXiv:1709.06182 [cs.SE] https://arxiv.org/abs/1709.06182

work page arXiv 2018
[5]

Victor Alves, Carla Bezerra, Ivan Machado, Larissa Rocha, Tássio Virgínio, and Publio Silva. 2025. Quality Assessment of Python Tests Generated by Large Language Models. arXiv:2506.14297 [cs.SE] https://arxiv.org/abs/2506.14297

work page arXiv 2025
[6]

Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. To Code, or Not To Code? Exploring Impact of Code in Pre-training. arXiv:2408.10914 [cs.CL] https://arxiv.org/abs/2408.10914

work page arXiv 2024
[7]

Owura Asare, Meiyappan Nagappan, and N. Asokan. 2024. Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? arXiv:2204.04741 [cs.SE] https://arxiv.org/abs/2204.04741

work page arXiv 2024
[8]

Abdul Awal, Mrigank Rochan, and Chanchal K

Md. Abdul Awal, Mrigank Rochan, and Chanchal K. Roy. 2025. Large Language Models as Robust Data Generators in Software Analytics: Are We There Yet? arXiv:2411.10565 [cs.SE] https://arxiv.org/abs/2411.10565

work page arXiv 2025
[9]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review arXiv 2023
[10]

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval Kapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun Vontimitta, Spencer Whitman, and Joshua Saxe. 2023. Pu...

work page arXiv 2023
[11]

Yaoyao Chang, Lei Cui, Li Dong, Shaohan Huang, Yangyu Huang, Yupan Huang, Scarlett Li, Tengchao Lv, Shuming Ma, Qinzheng Sun, Wenhui Wang, Furu Wei, Ying Xin, Mao Yang, Qiufeng Yin, and Xingxing Zhang. 2024. RedStone: Curating General, Code, Math, and QA Data for Large Language Models. arXiv:2412.03398 [cs.CL] https://arxiv.org/abs/2412.03398

work page arXiv 2024
[12]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 3 (2024), 1–45

2024
[13]

Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, Wei Ye, and Shikun Zhang. 2025. A Survey on Evaluating Large Language Models in Code Generation Tasks. arXiv:2408.16498 [cs.SE] https://arxiv.org/abs/2408.16498

work page arXiv 2025
[14]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review arXiv 2021
[15]

Cummings

Wanyi Chen, Meng-Wen Su, and Mary L. Cummings. 2025. Assessing LLM code generation quality through path planning tasks. arXiv:2504.21276 [cs.SE] https://arxiv.org/abs/2504.21276

work page arXiv 2025
[16]

Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, and Jingang Wang. 2025. Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekateri...

work page doi:10.18653/v1/2025.acl-long.1163 2025
[17]

Heejae Chon, Seonghyeon Lee, Jinyoung Yeo, and Dongha Lee. 2024. Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes. arXiv:2408.14504 [cs.SE] https://arxiv.org/abs/2408.14504

work page arXiv 2024
[18]

Chun Jie Chong, Zhihao Yao, and Iulian Neamtiu. 2024. Artificial-Intelligence Generated Code Considered Harmful: A Road Map for Secure and High-Quality Code Generation. arXiv:2409.19182 [cs.CR] https://arxiv.org/abs/2409.19182

work page arXiv 2024
[19]

Codefuse, Ling Team, :, Wenting Cai, Yuchen Cao, Chaoyu Chen, Chen Chen, Siba Chen, Qing Cui, Peng Di, Junpeng Fang, Zi Gong, Ting Guo, Zhengyu He, Yang Huang, Cong Li, Jianguo Li, Zheng Li, Shijie Lian, BingChang Liu, Songshan Luo, Shuo Mao, Min Shen, Jian Wu, Jiaolong Yang, Wenjie Yang, Tong Ye, Hang Yu, Wei Zhang, Zhenduo Zhang, Hailin Zhao, Xunjin Zhe...

work page arXiv 2025
[20]

Domenico Cotroneo, Roberta De Luca, and Pietro Liguori. 2024. DeVAIC: A Tool for Security Assessment of AI-generated Code. arXiv:2404.07548 [cs.SE] https://arxiv.org/abs/2404.07548

work page arXiv 2024
[21]

Yihong Dong, Yuchen Liu, Xue Jiang, Zhi Jin, and Ge Li. 2025. Rethinking Repetition Problems of LLMs in Code Generation. arXiv:2505.10402 [cs.CL] https://arxiv.org/abs/2505.10402

work page arXiv 2025
[22]

Mingzhe Du, Anh Tuan Luu, Bin Ji, Qian Liu, and See-Kiong Ng. 2024. Mercury: A Code Efficiency Benchmark for Code Large Language Models. arXiv:2402.07844 [cs.SE] https://arxiv.org/abs/2402.07844

work page arXiv 2024
[23]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation. In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). 982–994. https://doi.org/10.1145/3597503.3639219

work page doi:10.1145/3597503.3639219 2024
[24]

Yongkang Du, Jen tse Huang, Jieyu Zhao, and Lu Lin. 2025. FairCoder: Evaluating Social Bias of LLMs in Code Generation. arXiv:2501.05396 [cs.CL] https://arxiv.org/abs/2501.05396

work page arXiv 2025
[25]

Guoliang Duan, Mingwei Liu, Yanlin Wang, Chong Wang, Xin Peng, and Zibin Zheng. 2025. A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback. arXiv:2507.00699 [cs.SE] https://arxiv.org/abs/2507.00699

work page arXiv 2025
[26]

Maria Dziuba and Valentin Malykh. 2025. CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement. arXiv:2505.19757 [cs.SE] https://arxiv.org/abs/2505.19757

work page arXiv 2025
[27]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated repair of programs from large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1469–1481

2023
[28]

Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Ohi, Masaki Kawamura, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, and Naoaki Okazaki. 2025. Rewriting Pre-Training Data Boosts LLM Performance in Math and Code. arXiv:2505.02881 [cs.LG] https://arx...

work page arXiv 2025
[29]

Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. 2024. The Current Challenges of Software Engineering in the Era of Large Language Models. arXiv:2412.14554 [cs.SE] https://arxiv.org/abs/2412.14554

work page arXiv 2024
[30]

Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu, Walter Pretschner, Heinz Koeppl, and Fakhri Karray. 2025. CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval. arXiv:2506.11066 [cs.SE] https: //arxiv.org/abs/2506.11066

work page arXiv 2025
[31]

Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao. 2024. Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13

2024
[32]

Olausson, Celine Lee, Koushik Sen, and Armando Solar-Lezama

Alex Gu, Wen-Ding Li, Naman Jain, Theo X. Olausson, Celine Lee, Koushik Sen, and Armando Solar-Lezama. 2024. The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations? arXiv:2402.19475 [cs.SE] https://arxiv.org/abs/2402.19475

work page arXiv 2024
[33]

Wenchao Gu, Juntao Chen, Yanlin Wang, Tianyue Jiang, Xingzhe Li, Mingwei Liu, Xilin Liu, Yuchi Ma, and Zibin Zheng. 2025. What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond. arXiv:2503.20589 [cs.SE] https://arxiv.org/abs/2503.20589

work page arXiv 2025
[34]

Batu Guan, Yao Wan, Zhangqian Bi, Zheng Wang, Hongyu Zhang, Pan Zhou, and Lichao Sun. 2024. CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code. arXiv:2404.15639 [cs.CL] https://arxiv.org/abs/2404.15639

work page arXiv 2024
[35]

Ningxin Gui, Qianghuai Jia, Feijun Jiang, Yuling Jiao, dechun wang, and Jerry Zhijian Yang. 2025. CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation. arXiv:2505.10594 [cs.SE] https://arxiv.org/abs/2505.10594

work page arXiv 2025
[36]

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2023. Textbooks Are All You Need. arXiv:2306.11644 [...

work page internal anchor Pith review arXiv 2023
[37]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https://arxiv.org/abs/2401.14196

work page internal anchor Pith review arXiv 2024
[38]

Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security Hardening and Adversarial Testing. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(Copenhagen, Denmark)(CCS ’23). Association for Computing Machinery, New York, NY, USA, 1865–1879. https://doi.org/10.1145/3576915.3623175

work page doi:10.1145/3576915.3623175 2023
[39]

Kaifeng He, Mingwei Liu, Chong Wang, Zike Li, Yanlin Wang, Xin Peng, and Zibin Zheng. 2026. Towards Better Code Generation: Adaptive Decoding with Uncertainty Guidance. arXiv:2506.08980 [cs.SE] https://arxiv.org/abs/2506.08980 Manuscript submitted to ACM 44 K. He, X. Zhang, P. Cai, M. Liu, Y. Wang, C. Wang, K. Huang, B. Chen, X. Peng, and Z. Zheng

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Tianxing He, Jingzhao Zhang, Zhiming Zhou, and James Glass. 2021. Exposure Bias versus Self-Recovery: Are Distortions Really Incremental for Autoregressive Text Generation? arXiv:1905.10617 [cs.LG] https://arxiv.org/abs/1905.10617

work page arXiv 2021
[41]

Fahim Arefin, and Tarannum Shaila Zaman

Md Sifat Hossain, Anika Tabassum, Md. Fahim Arefin, and Tarannum Shaila Zaman. 2025. LLM-ProS: Analyzing Large Language Models’ Performance in Competitive Problem Solving. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, 80–87. https://doi.org/10.1109/llm4code66737.2025.00015

work page doi:10.1109/llm4code66737.2025.00015 2025
[42]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

2024
[43]

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M. Zhang. 2025. EffiBench: Benchmarking the Efficiency of Automatically Generated Code. arXiv:2402.02037 [cs.SE] https://arxiv.org/abs/2402.02037

work page arXiv 2025
[44]

Nam Huynh and Beiyu Lin. 2025. Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications. arXiv:2503.01245 [cs.SE] https://arxiv.org/abs/2503.01245

work page arXiv 2025
[45]

Cristina Improta, Rosalia Tufano, Pietro Liguori, Domenico Cotroneo, and Gabriele Bavota. 2025. Quality In, Quality Out: Investigating Training Data’s Role in AI Code Generation. arXiv:2503.11402 [cs.SE] https://arxiv.org/abs/2503.11402

work page arXiv 2025
[46]

International Organization for Standardization. 2023. Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE) — Product quality model. ISO/IEC 25010:2023

2023
[47]

Maliheh Izadi, Jonathan Katzy, Tim Van Dam, Marc Otten, Razvan Mihai Popescu, and Arie Van Deursen. 2024. Language models for code completion: A practical evaluation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

2024
[48]

Mahmoud Jahanshahi and Audris Mockus. 2025. Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets. In 2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, 104–111. https://doi.org/10.1109/llm4code66737. 2025.00018

work page doi:10.1109/llm4code66737 2025
[49]

Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. 2024. On Mitigating Code LLM Hallucinations with API Documentation. arXiv:2407.09726 [cs.CL] https://arxiv.org/abs/2407.09726

work page arXiv 2024
[50]

Devanbu, and Emily Morgan

Kevin Jesse, Toufique Ahmed, Premkumar T. Devanbu, and Emily Morgan. 2023. Large Language Models and Simple, Stupid Bugs. arXiv:2303.11455 [cs.SE] https://arxiv.org/abs/2303.11455

work page arXiv 2023
[51]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A Survey on Large Language Models for Code Generation.ACM Transactions on Software Engineering and Methodology35, 2 (Jan. 2026), 1–72. https://doi.org/10.1145/3747588

work page doi:10.1145/3747588 2026
[52]

Siyuan Jiang, Jia Li, He Zong, Huanyu Liu, Hao Zhu, Shukai Hu, Erlu Li, Jiazheng Ding, Yu Han, Wei Ning, Gen Wang, Yihong Dong, Kechi Zhang, and Ge Li. 2025. aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing. arXiv:2410.13187 [cs.CL] https://arxiv.org/abs/2410.13187

work page arXiv 2025
[53]

Weipeng Jiang, Xuanqi Gao, Juan Zhai, Shiqing Ma, Xiaoyu Zhang, Ziyan Lei, and Chao Shen. 2025. From Effectiveness to Efficiency: Uncovering Linguistic Bias in Large Language Model-based Code Generation. arXiv:2406.00602 [cs.SE] https://arxiv.org/abs/2406.00602

work page arXiv 2025
[54]

Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, and David Mohaisen. 2025. Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis. arXiv:2502.01853 [cs.CR] https://arxiv.org/abs/2502.01853

work page arXiv 2025
[55]

Bissyande

Kisub Kim, Jounghoon Kim, Byeongjo Park, Dongsun Kim, Chun Yong Chong, Yuan Wang, Tiezhu Sun, Daniel Tang, Jacques Klein, and Tegawende F. Bissyande. 2024. DataRecipe — How to Cook the Data for CodeLLM?. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Ma...

work page doi:10.1145/3691620.3695593 2024
[56]

2007.Guidelines for performing systematic literature reviews in software engineering

Barbara Kitchenham and Stuart Charters. 2007.Guidelines for performing systematic literature reviews in software engineering. Technical Report. Keele University and Durham University Joint Report, EBSE 2007-001

2007
[57]

Jasmine Latendresse, SayedHassan Khatoonabadi, Ahmad Abdellatif, and Emad Shihab. 2024. Is ChatGPT a Good Software Librarian? An Exploratory Study on the Use of ChatGPT for Software Library Recommendations. arXiv:2408.05128 [cs.SE] https://arxiv.org/abs/2408.05128

work page arXiv 2024
[58]

Triet HM Le, Hao Chen, and Muhammad Ali Babar. 2020. Deep learning for source code modeling and generation: Models, applications, and challenges.ACM Computing Surveys (CSUR)53, 3 (2020), 1–38

2020
[59]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. arXiv:2107.06499 [cs.CL] https://arxiv.org/abs/2107.06499

work page arXiv 2022
[60]

Huayang Li, Tian Lan, Zihao Fu, Deng Cai, Lemao Liu, Nigel Collier, Taro Watanabe, and Yixuan Su. 2023. Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective. arXiv:2310.10226 [cs.CL] https://arxiv.org/abs/2310.10226

work page arXiv 2023
[61]

Datacomp- LM : In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...

work page arXiv 2025
[62]

Jia Li, Zeyang Zhuang, Zhuangbin Chen, Yuxin Su, Wei Meng, and Michael R. Lyu. 2026. ComBench: A Repo-level Real-world Benchmark for Compilation Error Repair. arXiv:2603.27333 [cs.SE] https://arxiv.org/abs/2603.27333

work page arXiv 2026
[63]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

work page internal anchor Pith review arXiv 2023
[64]

Yuanheng Li, Zhuoyang Chen, Xiaoyun Liu, Yuhao Wang, Mingwei Liu, Yang Shi, Kaifeng Huang, and Shengjie Zhao. 2025. Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach. arXiv:2511.07033 [cs.CR] https://arxiv.org/abs/2511.07033

work page arXiv 2025
[65]

Zike Li, Mingwei Liu, Anji Li, Kaifeng He, Yanlin Wang, Xin Peng, and Zibin Zheng. 2025. A Preliminary Study on the Robustness of Code Generation by Large Language Models. arXiv:2503.20197 [cs.SE] https://arxiv.org/abs/2503.20197

work page arXiv 2025
[66]

Xiaoli Lian, Shuaisong Wang, Jieping Ma, Xin Tan, Fang Liu, Lin Shi, Cuiyun Gao, and Li Zhang. 2024. Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings(Lisbon, Portugal)(ICSE-Companion ’24). A...

work page doi:10.1145/3639478.3643081 2024
[67]

Linxi Liang, Jing Gong, Mingwei Liu, Chong Wang, Guangsheng Ou, Yanlin Wang, Xin Peng, and Zibin Zheng. 2025. RustEvo ˆ2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation. arXiv:2503.16922 [cs.SE] https://arxiv.org/abs/2503.16922

work page arXiv 2025
[68]

Yalan Lin, Chengcheng Wan, Yixiong Fang, and Xiaodong Gu. 2024. CodeCipher: Learning to Obfuscate Source Code Against LLMs. arXiv:2410.05797 [cs.CL] https://arxiv.org/abs/2410.05797

work page arXiv 2024
[69]

Lin Ling, Fazle Rabbi, Song Wang, and Jinqiu Yang. 2025. Bias Unveiled: Investigating Social Bias in LLM-Generated Code. arXiv:2411.10351 [cs.SE] https://arxiv.org/abs/2411.10351

work page arXiv 2025
[70]

Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. 2024. Exploring and Evaluating Hallucinations in LLM-Powered Code Generation. arXiv:2404.00971 [cs.SE] https://arxiv.org/abs/2404.00971

work page arXiv 2024
[71]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:2305.01210 [cs.SE] https://arxiv.org/abs/2305.01210

work page internal anchor Pith review arXiv 2023
[72]

Mingwei Liu, Juntao Li, Ying Wang, Xueying Du, Zuoyu Ou, Qiuyuan Chen, Bingxu An, Zhao Wei, Yong Xu, Fangming Zou, Xin Peng, and Yiling Lou
[73]

Code copycat conundrum: Demystifying repetition in llm-based code generation.CoRR, abs/2504.12608, 2025

Code Copycat Conundrum: Demystifying Repetition in LLM-based Code Generation. arXiv:2504.12608 [cs.SE] https://arxiv.org/abs/2504.12608

work page arXiv
[74]

Mingwei Liu, Zheng Pei, Yanlin Wang, Zihao Wang, Zikang Li, Enci Lin, Xin Peng, and Zibin Zheng. 2025. Framework-Aware Code Generation with API Knowledge Graph-Constructed Data: A Study on HarmonyOS. arXiv:2512.00380 [cs.SE] https://arxiv.org/abs/2512.00380

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Sicong Liu, Yanxian Huang, Mingwei Liu, Jiachi Chen, Ensheng Shi, Yuchi Ma, Hongyu Zhang, Yin Zhang, and Yanlin Wang. 2026. ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation. arXiv:2601.09703 [cs.SE] https://arxiv.org/abs/2601.09703

work page arXiv 2026
[76]

Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. 2024. Datasets for Large Language Models: A Comprehensive Survey. arXiv:2402.18041 [cs.CL] https://arxiv.org/abs/2402.18041

work page arXiv 2024
[77]

Refining ChatGPT- Generated Code: Characterizing and Mitigating Code Quality Issues

Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo. 2023. Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues. arXiv:2307.12596 [cs.SE] https://arxiv.org/abs/2307.12596

work page arXiv 2023
[78]

Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, and Liang Feng Zhang. 2024. No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT. arXiv:2308.04838 [cs.SE] https://arxiv.org/abs/2308.04838

work page arXiv 2024
[79]

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Z...

work page internal anchor Pith review arXiv 2024
[80]

Junyu Luo, Bohan Wu, Xiao Luo, Zhiping Xiao, Yiqiao Jin, Rong-Cheng Tu, Nan Yin, Yifan Wang, Jingyang Yuan, Wei Ju, and Ming Zhang. 2025. A Survey on Efficient Large Language Model Training: From Data-centric Perspectives. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Com...

work page doi:10.18653/v1/2025.acl- 2025

Showing first 80 references.