hub Canonical reference

A Survey on Code Generation with LLM-based Agents

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin · 2025 · cs.SE · arXiv 2508.00083

Canonical reference. 90% of citing Pith papers cite this work as background.

31 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 31 citing papers arXiv PDF

abstract

Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three core features. 1) Autonomy: the ability to independently manage the entire workflow, from task decomposition to coding and debugging. 2) Expanded task scope: capabilities that extend beyond generating code snippets to encompass the full software development lifecycle (SDLC). 3) Enhancement of engineering practicality: a shift in research emphasis from algorithmic innovation toward practical engineering challenges, such as system reliability, process management, and tool integration. This domain has recently witnessed rapid development and an explosion in research, demonstrating significant application potential. This paper presents a systematic survey of the field of LLM-based code generation agents. We trace the technology's developmental trajectory from its inception and systematically categorize its core techniques, including both single-agent and multi-agent architectures. Furthermore, this survey details the applications of LLM-based agents across the full SDLC, summarizes mainstream evaluation benchmarks and metrics, and catalogs representative tools. Finally, by analyzing the primary challenges, we identify and propose several foundational, long-term research directions for the future work of the field.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10

citation-polarity summary

background 9 unclear 1

representative citing papers

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

cs.CE · 2026-05-15 · unverdicted · novelty 7.0

BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

cs.SE · 2026-04-12 · unverdicted · novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.

Think Anywhere in Code Generation

cs.SE · 2026-03-31 · unverdicted · novelty 7.0

Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.

Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation

cs.SE · 2026-02-06 · conditional · novelty 7.0

SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.

CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding

cs.SE · 2025-11-30 · unverdicted · novelty 7.0 · 2 refs

Human-AI collaboration on CentaurEval's collaboration-necessary tasks reaches 31.11% success, far above standalone humans at 18.89% or LLMs at 0.67%.

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

cs.SE · 2025-10-21 · conditional · novelty 7.0

CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.

Self-Evolving Deep Research via Joint Generation and Evaluation

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

SCORE is a shared-parameter co-evolutionary framework coupling generation and evaluation of deep research reports with a meta-harness to adapt evaluation standards as performance improves.

AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development

cs.SE · 2026-05-04 · unverdicted · novelty 6.0

More capable LLMs and agents generate code with greater volume and architectural decay, following a Volume-Quality Inverse Law that neither functional correctness nor prompting mitigates.

QuantClaw: Precision Where It Matters for OpenClaw

cs.AI · 2026-04-24 · unverdicted · novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

cs.SE · 2026-04-07 · unverdicted · novelty 6.0

LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.

ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness

cs.DC · 2026-02-14 · unverdicted · novelty 6.0

ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.

Large Language Model Agent for User-friendly Chemical Process Simulations

physics.chem-ph · 2026-01-15 · unverdicted · novelty 6.0

An LLM agent integrated with AVEVA Process Simulation via MCP enables natural language driven flowsheet analysis, optimization, and construction for chemical separation processes.

Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering

cs.SE · 2026-07-01 · unverdicted · novelty 5.0

A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.

Personalization as Inverse Planning: Learning Latent Design Intents for Agentic Slide Generation via Structural Denoising

cs.AI · 2026-07-01 · unverdicted · novelty 5.0

SPIRE approximates page-level slide personalization by training agents to denoise corrupted slide structures via collaborative RL, claiming a proof of consistency as a surrogate for inverse planning.

TacEvo: Self-Evolving Architecture Discovery for Robotic Tactile Perception via LLM-Driven Quality-Diversity Search

cs.RO · 2026-06-29 · unverdicted · novelty 5.0

TacEvo is an LLM-driven self-evolving search method that discovers neural architectures for robotic tactile force regression and grating classification, reporting fitness gains of 56.1% and 96.1% over 20 generations.

Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum

cs.RO · 2026-05-20 · unverdicted · novelty 5.0

A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.

Code as Agent Harness

cs.CL · 2026-05-18 · accept · novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.

Context Training with Active Information Seeking

cs.CL · 2026-05-13 · unverdicted · novelty 5.0 · 2 refs

Active information seeking via search tools, when combined with multi-candidate context pruning during training, produces consistent gains on translation, health, and reasoning tasks over naive tool addition or no-tool baselines.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation cs.SE · 2026-02-06 · conditional · none · ref 10 · internal anchor
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment cs.SE · 2025-10-21 · conditional · none · ref 6 · internal anchor
CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.

A Survey on Code Generation with LLM-based Agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer