hub

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim · 2024 · cs.CL · arXiv 2406.00515

26 Pith papers cite this work. Polarity classification is still indexing.

26 Pith papers citing it

open full Pith review browse 26 citing papers arXiv PDF

abstract

Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This burgeoning field has captured significant interest from both academic researchers and industry professionals due to its practical significance in software development, e.g., GitHub Copilot. Despite the active exploration of LLMs for a variety of code tasks, either from the perspective of natural language processing (NLP) or software engineering (SE) or both, there is a noticeable absence of a comprehensive and up-to-date literature review dedicated to LLM for code generation. In this survey, we aim to bridge this gap by providing a systematic literature review that serves as a valuable reference for researchers investigating the cutting-edge progress in LLMs for code generation. We introduce a taxonomy to categorize and discuss the recent developments in LLMs for code generation, covering aspects such as data curation, latest advances, performance evaluation, ethical implications, environmental impact, and real-world applications. In addition, we present a historical overview of the evolution of LLMs for code generation and offer an empirical comparison using the HumanEval, MBPP, and BigCodeBench benchmarks across various levels of difficulty and types of programming tasks to highlight the progressive enhancements in LLM capabilities for code generation. We identify critical challenges and promising opportunities regarding the gap between academia and practical development. Furthermore, we have established a dedicated resource GitHub page (https://github.com/juyongjiang/CodeLLMSurvey) to continuously document and disseminate the most recent advances in the field.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Evaluating Non-English Developer Support in Machine Learning for Software Engineering

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.

BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design

cs.AI · 2026-04-14 · unverdicted · novelty 7.0

BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.

Evaluating LLMs Code Reasoning Under Real-World Context

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

cs.SE · 2026-04-12 · unverdicted · novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.

Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure

cs.SE · 2026-04-02 · accept · novelty 7.0

Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task difficulty predictions.

Automating Database-Native Function Code Synthesis with LLMs

cs.DB · 2026-04-02 · conditional · novelty 7.0

DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreSQL, and DuckDB while generating functions absent from SQLite 3.50.

Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts

cs.CR · 2026-05-05 · unverdicted · novelty 6.0

An LLM framework with tailored prompts and a new dataset of 31,165 annotated instances achieves 0.92 positive recall and 0.85 negative recall for detecting 13 smart contract vulnerability categories.

DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.

Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

cs.SE · 2026-04-24 · unverdicted · novelty 6.0

RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo generation on small projects versus module-by-module on complex ones.

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

cs.SE · 2026-04-23 · unverdicted · novelty 6.0

Metamorphic testing on Defects4J and GitBug-Java reveals substantial performance drops in seven LLMs that correlate with NLL, indicating data leakage in LLM-based program repair.

Understanding the Mechanism of Altruism in Large Language Models

econ.GN · 2026-04-21 · unverdicted · novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models

cs.SE · 2026-04-10 · unverdicted · novelty 6.0

A two-stage multi-agent LLM converts structural inputs to JSON then platform-specific scripts for ETABS, SAP2000, and OpenSees, achieving over 90% accuracy on 20 frame problems across ten trials.

Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

cs.SE · 2026-04-07 · unverdicted · novelty 6.0

LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.

Compiling Code LLMs into Lightweight Executables

cs.SE · 2026-03-31 · conditional · novelty 6.0

Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0.27% pass@1 accuracy.

Video models are zero-shot learners and reasoners

cs.LG · 2025-09-24 · unverdicted · novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.

Mono2Sls: Automated Monolith-to-Serverless Migration via Multi-Stage Pipeline with Static Analysis

cs.SE · 2026-04-27 · unverdicted · novelty 5.0

Mono2Sls automates monolith-to-serverless migration with static analysis and multi-stage LLM agents, achieving 100% deployment success and 66.1% end-to-end correctness on six benchmarks.

HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

cs.DC · 2026-04-18 · unverdicted · novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit

cs.CR · 2026-04-11 · unverdicted · novelty 5.0

Security practitioners use LLMs independently for low-risk productivity tasks while showing interest in enterprise platforms, but reliability, verification needs, and security risks limit broader autonomy.

citing papers explorer

Showing 26 of 26 citing papers.

Evaluating Non-English Developer Support in Machine Learning for Software Engineering cs.SE · 2026-05-07 · unverdicted · none · ref 97 · internal anchor
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design cs.AI · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.
Evaluating LLMs Code Reasoning Under Real-World Context cs.SE · 2026-04-14 · unverdicted · none · ref 13 · internal anchor
R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search cs.SE · 2026-04-12 · unverdicted · none · ref 19 · internal anchor
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling cs.AI · 2026-04-09 · unverdicted · none · ref 35 · internal anchor
IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios cs.SE · 2026-04-08 · unverdicted · none · ref 11 · internal anchor
A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy cs.CL · 2026-04-03 · unverdicted · none · ref 29 · internal anchor
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure cs.SE · 2026-04-02 · accept · none · ref 10 · internal anchor
Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task difficulty predictions.
Automating Database-Native Function Code Synthesis with LLMs cs.DB · 2026-04-02 · conditional · none · ref 29 · internal anchor
DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreSQL, and DuckDB while generating functions absent from SQLite 3.50.
Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts cs.CR · 2026-05-05 · unverdicted · none · ref 8 · internal anchor
An LLM framework with tailored prompts and a new dataset of 31,165 annotated instances achieves 0.92 positive recall and 0.85 negative recall for detecting 13 smart contract vulnerability categories.
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis cs.CL · 2026-04-29 · unverdicted · none · ref 2 · internal anchor
DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment cs.LG · 2026-04-27 · unverdicted · none · ref 8 · internal anchor
Meta-Aligner introduces a meta-learner network that produces dynamic preference weights to enable bidirectional optimization between preferences and LLM policy responses for multi-objective alignment.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices cs.SE · 2026-04-24 · unverdicted · none · ref 23 · internal anchor
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo generation on small projects versus module-by-module on complex ones.
A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair cs.SE · 2026-04-23 · unverdicted · none · ref 39 · internal anchor
Metamorphic testing on Defects4J and GitBug-Java reveals substantial performance drops in seven LLMs that correlate with NLL, indicating data leakage in LLM-based program repair.
Understanding the Mechanism of Altruism in Large Language Models econ.GN · 2026-04-21 · unverdicted · none · ref 76 · internal anchor
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models cs.SE · 2026-04-10 · unverdicted · none · ref 9 · internal anchor
A two-stage multi-agent LLM converts structural inputs to JSON then platform-specific scripts for ETABS, SAP2000, and OpenSees, achieving over 90% accuracy on 20 frame problems across ten trials.
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution cs.SE · 2026-04-07 · unverdicted · none · ref 18 · internal anchor
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
Compiling Code LLMs into Lightweight Executables cs.SE · 2026-03-31 · conditional · none · ref 30 · internal anchor
Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0.27% pass@1 accuracy.
Video models are zero-shot learners and reasoners cs.LG · 2025-09-24 · unverdicted · none · ref 1 · internal anchor
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Mono2Sls: Automated Monolith-to-Serverless Migration via Multi-Stage Pipeline with Static Analysis cs.SE · 2026-04-27 · unverdicted · none · ref 25 · internal anchor
Mono2Sls automates monolith-to-serverless migration with static analysis and multi-stage LLM agents, achieving 100% deployment success and 66.1% end-to-end correctness on six benchmarks.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention cs.DC · 2026-04-18 · unverdicted · none · ref 6 · internal anchor
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning cs.CL · 2026-04-11 · unverdicted · none · ref 50 · internal anchor
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs cs.CL · 2026-04-11 · unverdicted · none · ref 50 · internal anchor
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit cs.CR · 2026-04-11 · unverdicted · none · ref 64 · internal anchor
Security practitioners use LLMs independently for low-risk productivity tasks while showing interest in enterprise platforms, but reliability, verification needs, and security risks limit broader autonomy.
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition cs.SE · 2026-04-03 · conditional · none · ref 31 · internal anchor
Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research cs.SE · 2025-04-22 · unreviewed · ref 30 · internal anchor

A Survey on Large Language Models for Code Generation

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer