arxiv: 2406.00515 · v2 · submitted 2024-06-01 · 💻 cs.CL · cs.AI· cs.SE

Recognition: 2 theorem links

· Lean Theorem

A Survey on Large Language Models for Code Generation

Fan Wang, Jiasi Shen, Juyong Jiang, Sunghun Kim, Sungju Kim

Pith reviewed 2026-05-13 20:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE

keywords large language modelscode generationsurveytaxonomybenchmarksdata curationethical implicationssoftware engineering

0 comments

The pith

A survey organizes recent work on large language models that generate code from natural language into a taxonomy with benchmark comparisons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey compiles research on large language models that produce source code from plain-language prompts. It introduces a taxonomy that groups developments by data curation, model advances, performance tests, ethical questions, environmental costs, and real-world uses in software tools. The authors trace the models' historical growth and run head-to-head comparisons on HumanEval, MBPP, and BigCodeBench to show steady gains across easy and hard programming tasks. They flag open gaps between academic prototypes and production needs while offering an online page to track new results.

Core claim

We introduce a taxonomy to categorize and discuss the recent developments in LLMs for code generation, covering aspects such as data curation, latest advances, performance evaluation, ethical implications, environmental impact, and real-world applications. In addition, we present a historical overview of the evolution of LLMs for code generation and offer an empirical comparison using the HumanEval, MBPP, and BigCodeBench benchmarks across various levels of difficulty and types of programming tasks to highlight the progressive enhancements in LLM capabilities for code generation. We identify critical challenges and promising opportunities regarding the gap between academia and practical.

What carries the argument

The taxonomy that groups LLM code-generation work by data curation, model advances, performance evaluation, ethics, environmental impact, and applications.

If this is right

Newer models demonstrate clear gains on benchmarks of increasing difficulty.
A persistent gap exists between academic results and requirements for practical software development.
Ethical and environmental factors must be weighed together with accuracy when deploying these models.
Shared resources allow the community to update the overview as new models appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could be reused as a template for surveys on related tasks such as code repair or explanation.
Emphasis on environmental impact may encourage development of smaller or more efficient training methods.
Benchmark trends point toward the need for harder, more realistic programming challenges to continue progress.

Load-bearing premise

The papers chosen and the taxonomy used to group them capture the full state of the field without major selection bias.

What would settle it

A set of recent publications on large language models for code generation that cannot be placed in any category of the taxonomy or that show benchmark trends opposite to the reported improvements.

read the original abstract

Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. Despite the active exploration of LLMs for a variety of code tasks, either from the perspective of natural language processing (NLP) or software engineering (SE) or both, there is a noticeable absence of a comprehensive and up-to-date literature review dedicated to LLM for code generation. In this survey, we aim to bridge this gap by providing a systematic literature review that serves as a valuable reference for researchers investigating the cutting-edge progress in LLMs for code generation. We introduce a taxonomy to categorize and discuss the recent developments in LLMs for code generation, covering aspects such as data curation, latest advances, performance evaluation, ethical implications, environmental impact, and real-world applications. In addition, we present a historical overview of the evolution of LLMs for code generation and offer an empirical comparison using the HumanEval, MBPP, and BigCodeBench benchmarks across various levels of difficulty and types of programming tasks to highlight the progressive enhancements in LLM capabilities for code generation. We identify critical challenges and promising opportunities regarding the gap between academia and practical development. Furthermore, we have established a dedicated resource GitHub page (https://github.com/juyongjiang/CodeLLMSurvey) to continuously document and disseminate the most recent advances in the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A solid organizing survey on code LLMs that adds a taxonomy, historical view, and benchmark snapshot, but the systematic claim is undercut by missing details on paper selection.

read the letter

This paper is a literature survey on LLMs for code generation. It lays out a taxonomy that groups work by data curation, model advances, evaluation, ethics, environmental costs, and applications, plus a short history and side-by-side numbers on HumanEval, MBPP, and BigCodeBench. They also point to a GitHub repo that will track new papers. That structure and the repo are the useful parts; they give newcomers a map and keep the thing from going stale right away. The benchmark section at least collects results across task difficulties, which saves some digging even if the numbers themselves come from prior papers. The inclusion of ethics and real-world deployment angles is welcome, since many technical surveys skip them. The main weakness is that the abstract calls the review systematic without showing the search protocol, databases, query strings, or screening rules. That leaves open the chance that the taxonomy and paper choices reflect what the authors happened to find rather than the full field. If the full text still lacks a reproducible method section, readers will have to treat the coverage as curated rather than exhaustive. The empirical comparison looks like aggregation of published scores, not new runs, so it does not add fresh data. This is the kind of paper that helps a grad student or engineer get oriented quickly and find the main references. It does not settle open questions in the area. I would send it to peer review; a cleaned-up version with explicit selection details would be a reasonable reference even if it is not definitive.

Referee Report

1 major / 2 minor

Summary. The paper is a survey on Large Language Models for code generation (Code LLMs). It introduces a taxonomy covering data curation, latest model advances, performance evaluation, ethical implications, environmental impact, and real-world applications; provides a historical overview of the field's evolution; presents empirical comparisons on HumanEval, MBPP, and BigCodeBench across difficulty levels and task types; identifies challenges and opportunities between academia and practice; and maintains a GitHub repository for ongoing updates.

Significance. If the literature coverage is comprehensive and unbiased, the survey would provide a valuable organizing reference for researchers at the NLP-SE intersection, with its multi-benchmark empirical section and explicit treatment of ethics/environmental factors offering practical utility beyond typical reviews. The living GitHub resource is a positive contribution for field currency.

major comments (1)

[Introduction] Introduction and any dedicated methodology section: the manuscript positions itself as a 'systematic literature review' yet provides no reproducible protocol details (databases queried, exact search strings, date cutoffs, inclusion/exclusion criteria, or counts of papers screened versus included). This directly affects the defensibility of the taxonomy's completeness and the claim that it represents 'cutting-edge progress' without selection bias.

minor comments (2)

[Taxonomy] Taxonomy section: ensure each top-level category (e.g., data curation) is accompanied by explicit criteria or examples of how papers were assigned, to improve reproducibility of the categorization.
[Performance Evaluation] Benchmark comparison section: clarify whether the reported results use the exact same prompting setup and decoding parameters across models, or note any variations that could affect cross-model comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address the major comment below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: [Introduction] Introduction and any dedicated methodology section: the manuscript positions itself as a 'systematic literature review' yet provides no reproducible protocol details (databases queried, exact search strings, date cutoffs, inclusion/exclusion criteria, or counts of papers screened versus included). This directly affects the defensibility of the taxonomy's completeness and the claim that it represents 'cutting-edge progress' without selection bias.

Authors: We acknowledge this is a valid observation. Although the survey was conducted systematically, the original manuscript omitted an explicit methodology section. In the revision, we will add a dedicated Section 2 (Methodology) that details: (1) databases queried (arXiv, Google Scholar, ACL Anthology, IEEE Xplore, and DBLP); (2) exact search strings (e.g., (large language model OR LLM) AND (code generation OR code synthesis OR program synthesis)); (3) date cutoff (papers published or posted up to May 2024); (4) inclusion criteria (peer-reviewed or preprint works primarily addressing LLMs for code generation tasks, with empirical results or novel methods); (5) exclusion criteria (non-English papers, duplicates, surveys without new analysis, or works focused solely on non-generation code tasks); and (6) screening statistics (initial retrieval count, papers screened after title/abstract, and final included papers). This addition will strengthen reproducibility and mitigate concerns about selection bias while preserving the taxonomy and empirical comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity in survey structure or claims

full rationale

This paper is a systematic literature review that introduces a taxonomy for organizing external research on LLMs for code generation and summarizes cited works across data curation, advances, evaluation, ethics, impact, and applications. It contains no original derivations, equations, predictions, fitted parameters, or self-referential claims that could reduce to the paper's own inputs by construction. The central assertions rest on the selection and synthesis of prior literature rather than any internal mathematical or predictive chain, rendering circularity analysis inapplicable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper there are no free parameters, mathematical axioms, or invented entities required for a central scientific claim; the contribution rests on the completeness and organization of the reviewed literature.

pith-pipeline@v0.9.0 · 5593 in / 1067 out tokens · 45124 ms · 2026-05-13T20:13:23.419502+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
In addition, we present a historical overview of the evolution of LLMs for code generation and offer an empirical comparison using the HumanEval, MBPP, and BigCodeBench benchmarks

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Non-English Developer Support in Machine Learning for Software Engineering
cs.SE 2026-05 unverdicted novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design
cs.AI 2026-04 unverdicted novelty 7.0

BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.
Evaluating LLMs Code Reasoning Under Real-World Context
cs.SE 2026-04 unverdicted novelty 7.0

R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling
cs.AI 2026-04 unverdicted novelty 7.0

IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios
cs.SE 2026-04 unverdicted novelty 7.0

A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
cs.CL 2026-04 unverdicted novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
cs.SE 2026-04 accept novelty 7.0

Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...
Automating Database-Native Function Code Synthesis with LLMs
cs.DB 2026-04 conditional novelty 7.0

DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreS...
Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts
cs.CR 2026-05 unverdicted novelty 6.0

An LLM framework with tailored prompts and a new dataset of 31,165 annotated instances achieves 0.92 positive recall and 0.85 negative recall for detecting 13 smart contract vulnerability categories.
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis
cs.CL 2026-04 unverdicted novelty 6.0

DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment
cs.LG 2026-04 unverdicted novelty 6.0

Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
cs.SE 2026-04 unverdicted novelty 6.0

RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo g...
A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair
cs.SE 2026-04 unverdicted novelty 6.0

Metamorphic testing on Defects4J and GitBug-Java reveals substantial performance drops in seven LLMs that correlate with NLL, indicating data leakage in LLM-based program repair.
Understanding the Mechanism of Altruism in Large Language Models
econ.GN 2026-04 unverdicted novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models
cs.SE 2026-04 unverdicted novelty 6.0

A two-stage multi-agent LLM converts structural inputs to JSON then platform-specific scripts for ETABS, SAP2000, and OpenSees, achieving over 90% accuracy on 20 frame problems across ten trials.
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
cs.SE 2026-04 unverdicted novelty 6.0

LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
Compiling Code LLMs into Lightweight Executables
cs.SE 2026-03 conditional novelty 6.0

Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0...
Video models are zero-shot learners and reasoners
cs.LG 2025-09 unverdicted novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Mono2Sls: Automated Monolith-to-Serverless Migration via Multi-Stage Pipeline with Static Analysis
cs.SE 2026-04 unverdicted novelty 5.0

Mono2Sls automates monolith-to-serverless migration with static analysis and multi-stage LLM agents, achieving 100% deployment success and 66.1% end-to-end correctness on six benchmarks.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit
cs.CR 2026-04 unverdicted novelty 5.0

Security practitioners use LLMs independently for low-risk productivity tasks while showing interest in enterprise platforms, but reliability, verification needs, and security risks limit broader autonomy.
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition
cs.SE 2026-04 conditional novelty 4.0

Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
cs.SE 2025-04

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 26 Pith papers · 52 internal anchors

[1]

AgentGPT: Assemble, configure, and deploy autonomous AI Agents in your browser

2023. AgentGPT: Assemble, configure, and deploy autonomous AI Agents in your browser. https://github.com/ reworkd/AgentGPT

work page 2023
[2]

AutoGPT is the vision of accessible AI for everyone, to use and to build on

2023. AutoGPT is the vision of accessible AI for everyone, to use and to build on. https://github.com/Significant- Gravitas/AutoGPT

work page 2023
[3]

2023. BabyAGI. https://github.com/yoheinakajima/babyagi

work page 2023
[4]

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . 4998–5007

work page 2020
[7]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021)

work page arXiv 2021
[8]

Ali Al-Kaswan, Maliheh Izadi, and Arie Van Deursen. 2024. Traces of memorisation in large language models for code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering . 1–12

work page 2024
[9]

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023)

work page arXiv 2023
[10]

Miltiadis Allamanis and Charles Sutton. 2014. Mining idioms from source code. In Proceedings of the 22nd acm sigsoft international symposium on foundations of software engineering . 472–483

work page 2014
[11]

Google DeepMind AlphaCode Team. 2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind- media/AlphaCode2/AlphaCode2_Tech_Report.pdf

work page 2023
[12]

Amazon. 2022. What is CodeWhisperer? https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is- cwspr.html

work page 2022
[13]

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Lo...

work page 2019
[14]

Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

work page 2024
[15]

Luca Ardito, Riccardo Coppola, Luca Barbato, and Diego Verga. 2020. A tool-based perspective on software code maintainability metrics: a systematic literature review. Scientific Programming 2020 (2020), 1–26

work page 2020
[16]

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. 2022. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868 (2022)

work page arXiv 2022
[17]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Hannah McLean Babe, Sydney Nguyen, Yangtian Zi, Arjun Guha, Molly Q Feldman, and Carolyn Jane Anderson. 2023. StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code. arXiv:2306.04556 [cs.LG]

work page arXiv 2023
[20]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok, Shashank Shet, et al. 2023. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499 (2023)

work page arXiv 2023
[23]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

work page 2005
[24]

Enrico Barbierato, Marco L Della Vedova, Daniele Tessera, Daniele Toti, and Nicola Vanoli. 2022. A methodology for controlling bias and fairness in synthetic data generation. Applied Sciences 12, 9 (2022), 4619

work page 2022
[25]

Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 85–111

work page 2023
[26]

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al . 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Zhangqian Bi, Yao Wan, Zheng Wang, Hongyu Zhang, Batu Guan, Fangxin Lu, Zili Zhang, Yulei Sui, Xuanhua Shi, and Hai Jin. 2024. Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback. arXiv preprint arXiv:2403.16792 (2024)

work page arXiv 2024
[28]

Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2022. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue 20, 6 (2022), 35–57

work page 2022
[29]

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al . 2022. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745 (2022)

work page arXiv 2022
[30]

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715 If you use this software, please cite it using these metadata

work page doi:10.5281/zenodo.5297715 2021
[31]

Veronika Bogina, Alan Hartman, Tsvi Kuflik, and Avital Shulner-Tal. 2022. Educating software and AI stakeholders about algorithmic fairness, accountability, transparency and ethics. International Journal of Artificial Intelligence in Education (2022), 1–26

work page 2022
[32]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[34]

Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability. IEEE Transactions on software engineering 36, 4 (2009), 546–558

work page 2009
[35]

Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, and Jiayi Huang. 2024. Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts. arXiv preprint arXiv:2404.05019 (2024)

work page arXiv 2024
[36]

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. 2024. A survey on mixture of experts. arXiv preprint arXiv:2407.06204 (2024). J. ACM, Vol. 37, No. 4, Article 1. Publication date: August 2018. A Survey on Large Language Models for Code Generation 1:57

work page arXiv 2024
[37]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) . 2633–2650

work page 2021
[38]

Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Carolyn Jane Anderson, Michael Greenberg, Abhinav Jangda, and Arjun Guha. 2023. Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs. arXiv preprint arXiv:2308.09895 (2023)

work page arXiv 2023
[39]

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al . 2022. A scalable and extensible approach to benchmarking nl2code for 18 programming languages. arXiv preprint arXiv:2208.08227 (2022)

work page arXiv 2022
[40]

Yekun Chai, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, and Hua Wu. 2022. ERNIE-Code: Beyond english-centric cross-lingual pretraining for programming languages. arXiv preprint arXiv:2212.06742 (2022)

work page arXiv 2022
[41]

Shubham Chandel, Colin B Clement, Guillermo Serrato, and Neel Sundaresan. 2022. Training and evaluating a jupyter notebook data science assistant. arXiv preprint arXiv:2201.12901 (2022)

work page arXiv 2022
[42]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology 15, 3 (2024), 1–45

work page 2024
[43]

Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github. com/sahil280114/codealpaca

work page 2023
[44]

Binger Chen and Ziawasch Abedjan. 2023. DUETCS: Code Style Transfer through Generation and Retrieval. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 2362–2373

work page 2023
[45]

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397 (2022)

work page arXiv 2022
[46]

Fuxiang Chen, Fatemeh H Fard, David Lo, and Timofey Bryksin. 2022. On the transferability of pre-trained language models for low-resource programming languages. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. 401–412

work page 2022
[47]

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2024. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 17754–17762

work page 2024
[48]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

Stanley F Chen, Douglas Beeferman, and Roni Rosenfeld. 1998. Evaluation metrics for language models. (1998)

work page 1998
[50]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. Advances in neural information processing systems 31 (2018)

work page 2018
[53]

Andrew A Chien, Liuzixuan Lin, Hai Nguyen, Varsha Rao, Tristan Sharma, and Rajini Wijayawardana. 2023. Reducing the Carbon Impact of Generative AI Inference (today and in 2035). In Proceedings of the 2nd workshop on sustainable computer systems. 1–7

work page 2023
[54]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113

work page 2023
[55]

Fenia Christopoulou, Gerasimos Lampouras, Milan Gritta, Guchun Zhang, Yinpeng Guo, Zhongqi Li, Qi Zhang, Meng Xiao, Bo Shen, Lin Li, et al. 2022. Pangu-coder: Program synthesis with function-level language modeling. arXiv preprint arXiv:2207.11280 (2022)

work page arXiv 2022
[56]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al . 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research 25, 70 (2024), 1–53

work page 2024
[57]

Colin B Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. PyMT5: multi-mode translation of natural language and Python code with transformers. arXiv preprint arXiv:2010.03150 (2020)

work page arXiv 2020
[58]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[59]

Choquette-Choo, Heri Zhao, Jane Fine, Jeffrey Hui, Jingyue Shen, Joe Kelley, Joshua Howland, Kshitij Bansal, Luke Vilnis, Mateo Wirth, Nam Nguyen, Paul Michel, J

CodeGemma Team, Ale Jakse Hartman, Andrea Hu, Christopher A. Choquette-Choo, Heri Zhao, Jane Fine, Jeffrey Hui, Jingyue Shen, Joe Kelley, Joshua Howland, Kshitij Bansal, Luke Vilnis, Mateo Wirth, Nam Nguyen, Paul Michel, J. ACM, Vol. 37, No. 4, Article 1. Publication date: August 2018. 1:58 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim P...

work page 2018
[60]

Codeium. 2023. Free, ultrafast Copilot alternative for Vim and Neovim. https://github.com/Exafunction/codeium.vim

work page 2023
[61]

Cognition. 2024. Introducing Devin, the first AI software engineer. https://www.cognition.ai/introducing-devin

work page 2024
[62]

Trevor Cohn, Phil Blunsom, and Sharon Goldwater. 2010. Inducing tree-substitution grammars. The Journal of Machine Learning Research 11 (2010), 3053–3096

work page 2010
[63]

Cognitive Computations. 2023. oa_leet10k. https://huggingface.co/datasets/cognitivecomputations/oa_leet10k

work page 2023
[64]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems . Springer, 337–340

work page 2008
[65]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[66]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[67]

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. 2022. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904 (2022)

work page arXiv 2022
[68]

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2024. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[69]

Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. 2022. Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv preprint arXiv:2212.10007 (2022)

work page arXiv 2022
[70]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Junjie Shan, Caishuang Huang, Wei Shen, Xiaoran Fan, Zhiheng Xi, et al. 2024. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback. arXiv preprint arXiv:2402.01391 (2024)

work page arXiv 2024
[72]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering . 1–13

work page 2024
[73]

Hugging Face. 2023. Training CodeParrot from Scratch. https://github.com/huggingface/blog/blob/main/codeparrot. md

work page 2023
[74]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE) . IEEE, 31–53

work page 2023
[75]

Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 1469–1481

work page 2023
[76]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

work page internal anchor Pith review arXiv 2020
[77]

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022)

work page arXiv 2022
[78]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[79]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Pal: Program-aided language models. In International Conference on Machine Learning . PMLR, 10764–10799

work page 2023
[80]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.