A Survey on Code Generation with LLM-based Agents
Pith reviewed 2026-05-19 22:58 UTC · model grok-4.3
The pith
LLM-based code generation agents manage entire software projects autonomously from task breakdown through debugging and deployment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-based code generation agents are defined by three distinguishing traits: autonomy that lets them oversee complete workflows without constant human direction, an expanded scope that reaches the full software development lifecycle rather than single functions or modules, and a practical engineering focus that stresses system reliability, process coordination, and integration with development tools over pure algorithmic advances.
What carries the argument
The three core features of autonomy, expanded task scope across the software development lifecycle, and enhancement of engineering practicality, used to classify single-agent versus multi-agent architectures and to structure the review of applications, benchmarks, and tools.
If this is right
- Agents are applied across every phase of the software development lifecycle rather than only code writing.
- Research attention shifts from new generation algorithms toward reliability, process management, and tool integration.
- Evaluation moves beyond isolated code correctness to end-to-end project success measured by new benchmarks and metrics.
- Multi-agent systems allow specialized roles and collaboration to tackle larger, more complex development tasks.
Where Pith is reading between the lines
- Successful maturation of these agents would likely change how human developers spend their time, moving emphasis from routine coding to specification, oversight, and integration decisions.
- The single-versus-multi-agent split may become less sharp as hybrid designs that combine both styles appear in real systems.
- If the proposed research directions are pursued, non-experts could gain practical ways to build and maintain software with minimal manual coding.
Load-bearing premise
The survey assumes that the rapidly expanding literature can be cleanly and comprehensively sorted into single-agent and multi-agent categories with no major omissions or alternative groupings that would change the overall picture.
What would settle it
Discovery of several widely cited, high-impact papers on LLM code generation whose architectures or workflows resist placement in either the single-agent or multi-agent category would show that the chosen organizational frame leaves out significant work.
read the original abstract
Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three core features. 1) Autonomy: the ability to independently manage the entire workflow, from task decomposition to coding and debugging. 2) Expanded task scope: capabilities that extend beyond generating code snippets to encompass the full software development lifecycle (SDLC). 3) Enhancement of engineering practicality: a shift in research emphasis from algorithmic innovation toward practical engineering challenges, such as system reliability, process management, and tool integration. This domain has recently witnessed rapid development and an explosion in research, demonstrating significant application potential. This paper presents a systematic survey of the field of LLM-based code generation agents. We trace the technology's developmental trajectory from its inception and systematically categorize its core techniques, including both single-agent and multi-agent architectures. Furthermore, this survey details the applications of LLM-based agents across the full SDLC, summarizes mainstream evaluation benchmarks and metrics, and catalogs representative tools. Finally, by analyzing the primary challenges, we identify and propose several foundational, long-term research directions for the future work of the field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys LLM-based code generation agents, claiming they are distinguished from prior techniques by three core features: autonomy to independently manage the full workflow from task decomposition to debugging; expanded scope across the entire software development lifecycle rather than isolated code snippets; and a shift toward engineering practicality including reliability, process management, and tool integration. It traces the developmental trajectory, categorizes core techniques into single-agent and multi-agent architectures, details applications across the SDLC, summarizes benchmarks/metrics and representative tools, and proposes long-term research directions based on identified challenges.
Significance. If the taxonomy and coverage hold, the survey would provide a useful organizing framework for a rapidly expanding subfield at the intersection of LLMs and software engineering, helping researchers identify patterns in agent architectures and gaps in practical deployment. The explicit focus on engineering challenges rather than pure algorithmic novelty is a constructive framing that aligns with industry needs.
major comments (2)
- [Abstract and §1] Abstract and §1 (Introduction): The central claim that the three features (autonomy, expanded SDLC scope, and engineering practicality) distinctly characterize LLM-based agents is load-bearing for the entire survey structure, yet the text provides no explicit contrast with prior non-agent code generation methods (e.g., direct LLM prompting or fine-tuned models) to demonstrate that these features are not already present or emergent in earlier work; without this grounding, the subsequent single/multi-agent categorization risks being an arbitrary overlay rather than a natural developmental trajectory.
- [§3] §3 (Architectures) and the literature selection description: The single-agent versus multi-agent taxonomy is presented as systematic, but the manuscript does not report search protocol, inclusion/exclusion criteria, database sources, or date range for the surveyed papers; this omission directly undermines the claim that the selected works represent core developments without major omissions, as hybrid or tool-centric systems that do not fit cleanly into the binary split could be under-represented.
minor comments (2)
- [Applications section] The abstract lists applications across the full SDLC but the corresponding section would benefit from a table summarizing which agent architectures are applied to which SDLC phases to improve readability.
- [Benchmarks section] Ensure that all cited benchmarks (e.g., HumanEval, MBPP extensions) include the exact metrics reported in the original papers rather than paraphrased summaries.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our survey. These observations help clarify the presentation of our core claims and improve the methodological transparency of the work. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1 (Introduction): The central claim that the three features (autonomy, expanded SDLC scope, and engineering practicality) distinctly characterize LLM-based agents is load-bearing for the entire survey structure, yet the text provides no explicit contrast with prior non-agent code generation methods (e.g., direct LLM prompting or fine-tuned models) to demonstrate that these features are not already present or emergent in earlier work; without this grounding, the subsequent single/multi-agent categorization risks being an arbitrary overlay rather than a natural developmental trajectory.
Authors: We agree that the distinction would benefit from more explicit grounding. The manuscript states that agents are 'distinct from previous code generation techniques' and enumerates the three features, but does not include a direct comparison. In the revision we will insert a short subsection (or expanded paragraph) in §1 that contrasts LLM-based agents with direct prompting and fine-tuned models, using concrete examples to show how autonomy over the full workflow, SDLC breadth, and engineering focus become central only in the agent setting. This addition will better motivate the subsequent taxonomy without altering the survey's scope. revision: yes
-
Referee: [§3] §3 (Architectures) and the literature selection description: The single-agent versus multi-agent taxonomy is presented as systematic, but the manuscript does not report search protocol, inclusion/exclusion criteria, database sources, or date range for the surveyed papers; this omission directly undermines the claim that the selected works represent core developments without major omissions, as hybrid or tool-centric systems that do not fit cleanly into the binary split could be under-represented.
Authors: We accept that the current draft lacks a transparent literature-selection description. Although the taxonomy reflects the dominant architectural patterns we observed, we will add a dedicated 'Literature Review Methodology' subsection at the start of §3. It will specify the databases searched (arXiv, Google Scholar, IEEE Xplore, ACM DL), the keyword combinations and date range (primarily 2022–2024), inclusion criteria (papers that explicitly describe LLM-powered agents for code generation), and exclusion criteria, together with a brief note on how hybrid or tool-centric systems are classified within the single- versus multi-agent framework. This revision directly addresses the concern about potential under-representation. revision: yes
Circularity Check
No circularity: survey organizes external literature without self-referential derivations
full rationale
This paper is a literature review that references external prior work to categorize LLM-based code generation agents into single-agent and multi-agent architectures and to trace developmental trajectories. It contains no equations, no fitted parameters, no predictions derived from its own inputs, and no self-citation chains that bear the central claims. The three core features (autonomy, expanded SDLC scope, engineering practicality) are presented as characterizations drawn from the surveyed body of work rather than results forced by the paper's own definitions or citations. Completeness of coverage is an assumption of any survey but does not constitute circularity under the defined criteria, as no reduction of a claimed result to the paper's own inputs is exhibited.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 26 Pith papers
-
PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.
-
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.
-
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
-
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
-
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
-
Think Anywhere in Code Generation
Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
-
Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
-
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reas...
-
Context Training with Active Information Seeking
Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...
-
AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development
More capable LLMs and agents generate code with greater volume and architectural decay, following a Volume-Quality Inverse Law that neither functional correctness nor prompting mitigates.
-
QuantClaw: Precision Where It Matters for OpenClaw
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
-
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
-
ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness
ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.
-
Code as Agent Harness
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
-
Context Training with Active Information Seeking
Active information seeking via search tools, when combined with multi-candidate context pruning during training, produces consistent gains on translation, health, and reasoning tasks over naive tool addition or no-too...
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
TDD Governance for Multi-Agent Code Generation via Prompt Engineering
An AI-native TDD framework operationalizes classical TDD principles as prompt-level and workflow-level governance mechanisms in a layered multi-agent architecture to improve stability and reproducibility of LLM code g...
-
Agentic Insight Generation in VSM Simulations
A two-step agentic system for extracting insights from VSM simulations achieves up to 86% accuracy with top LLMs by using progressive data discovery and slim context.
-
Rethinking Agentic Reinforcement Learning In Large Language Models
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
-
Sustainable Code Generation Using Large Language Models: A Systematic Literature Review
A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.
-
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...
-
Rethinking Agentic Reinforcement Learning In Large Language Models
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
-
Rethinking Agentic Reinforcement Learning In Large Language Models
This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
Reference graph
Works this paper leans on
-
[1]
Inductive programming: A survey of program synthesis techniques,
E. Kitzelmann, “Inductive programming: A survey of program synthesis techniques,” inInternational Work- shop on Approaches and Applications of Inductive Pro- gramming (AAIP), 2009, pp. 50–73
work page 2009
-
[2]
Latent predictor networks for code generation,
W. Ling, E. Grefenstette, K. M. Hermann, T. Ko ˇcisk`y, A. Senior, F. Wang, and P . Blunsom, “Latent predictor networks for code generation,” inMeeting of the As- sociation for Computational Linguistics (ACL), 2016, pp. 599–609
work page 2016
-
[3]
A syntactic neural model for general-purpose code generation,
P . Yin and G. Neubig, “A syntactic neural model for general-purpose code generation,” inMeeting of the Association for Computational Linguistics (ACL), 2017, pp. 440–450
work page 2017
-
[4]
LLaMA: Open and efficient founda- tion language models,
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “LLaMA: Open and efficient founda- tion language models,” 2023
work page 2023
-
[5]
LLaMA 2: Open foundation and fine- tuned chat models,
H. Touvron, L. Martin, K. Stone, P . Albert, A. Alma- hairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosaleet al., “LLaMA 2: Open foundation and fine- tuned chat models,” 2023. 19
work page 2023
-
[6]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wain- wright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inConference on Neural Information Processing Systems (NeurIPS), 2022, pp. 27 730–27 744
work page 2022
-
[7]
Code LLaMA: Open foundation models for code,
B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remezet al., “Code LLaMA: Open foundation models for code,” 2023
work page 2023
-
[8]
Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 8696–8708
work page 2021
-
[9]
Competition-level code generation with Alphacode,
Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrit- twieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lagoet al., “Competition-level code generation with Alphacode,”Science, vol. 378, no. 6624, pp. 1092– 1097, 2022
work page 2022
-
[10]
Self-planning code generation with large language models,
X. Jiang, Y. Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language models,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 7, pp. 1–30, 2024
work page 2024
-
[11]
H. Le, H. Chen, A. Saha, A. Gokul, D. Sahoo, and S. Joty, “CodeChain: Towards modular code genera- tion through chain of self-revisions with representa- tive sub-modules,” inInternational Conference on Learn- ing Representations (ICLR), 2023
work page 2023
-
[12]
CodeCoR: An llm- based self-reflective multi-agent framework for code generation,
R. Pan, H. Zhang, and C. Liu, “CodeCoR: An llm- based self-reflective multi-agent framework for code generation,” 2025
work page 2025
-
[13]
Codepori: Large scale model for autonomous software development by using multi- agents,
Z. Rasheed, M. Waseem, M. Saari, K. Syst ¨a, and P . Abrahamsson, “Codepori: Large scale model for autonomous software development by using multi- agents,” 2024
work page 2024
-
[14]
An autonomous multi-agent llm frame- work for agile software development,
S. Manish, “An autonomous multi-agent llm frame- work for agile software development,”International Journal of Trend in Scientific Research and Development, vol. 8, no. 5, pp. 892–898, 2024
work page 2024
-
[15]
Self-collaboration code generation via ChatGPT,
Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation via ChatGPT,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 7, pp. 1–38, 2024
work page 2024
-
[16]
Chatdev: Communicative agents for software development,
C. Qian, W. Liu, H. Liu, N. Chen, Y. Dang, J. Li, C. Yang, W. Chen, Y. Su, X. Conget al., “Chatdev: Communicative agents for software development,” in Meeting of the Association for Computational Linguistics (ACL), 2023, pp. 15 174–15 186
work page 2023
-
[17]
Metagpt: Meta programming for multi-agent collaborative framework,
S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou et al., “Metagpt: Meta programming for multi-agent collaborative framework,” inInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[18]
CodeXGLUE: A machine learning benchmark dataset for code understanding and generation,
S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., “CodeXGLUE: A machine learning benchmark dataset for code understanding and generation,” in Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021
work page 2021
-
[19]
ClarifyGPT: Empowering LLM- based code generation with intention clarification,
F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang, “ClarifyGPT: Empowering LLM- based code generation with intention clarification,” 2023
work page 2023
-
[20]
Z. Wang, W. Wang, Z. Li, L. Wang, C. Yi, X. Xu, L. Cao, H. Su, S. Chen, and J. Zhou, “Xuat-copilot: Multi-agent collaborative system for automated user acceptance testing with large language model,” 2024
work page 2024
-
[21]
Logiagent: Auto- mated logical testing for rest systems with llm-based multi-agents,
K. Zhang, C. Zhang, C. Wang, C. Zhang, Y. Wu, Z. Xing, Y. Liu, Q. Li, and X. Peng, “Logiagent: Auto- mated logical testing for rest systems with llm-based multi-agents,” 2025
work page 2025
-
[22]
Ai-driven refactoring: A pipeline for identifying and correcting data clumps in git reposi- tories,
N. Baumgartner, P . Iyenghar, T. Schoemaker, and E. Pulverm¨uller, “Ai-driven refactoring: A pipeline for identifying and correcting data clumps in git reposi- tories,”Electronics, vol. 13, no. 9, p. 1644, 2024
work page 2024
-
[23]
Abstract syntax networks for code generation and semantic parsing,
M. Rabinovich, M. Stern, and D. Klein, “Abstract syntax networks for code generation and semantic parsing,” inMeeting of the Association for Computational Linguistics (ACL), 2017, pp. 1139–1149
work page 2017
-
[24]
GraphCodeBERT: Pre-training code representations with data flow,
D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fuet al., “GraphCodeBERT: Pre-training code representations with data flow,” inInternational Conference on Learning Representations, 2021
work page 2021
-
[25]
UniXcoder: Unified cross-modal pre-training for code representation,
D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “UniXcoder: Unified cross-modal pre-training for code representation,” pp. 7212–7225, 2022
work page 2022
-
[26]
Qualityflow: An agentic workflow for program synthesis controlled by llm quality checks,
Y. Hu, Q. Zhou, Q. Chen, X. Li, L. Liu, D. Zhang, A. Kachroo, T. Oz, and O. Tripp, “Qualityflow: An agentic workflow for program synthesis controlled by llm quality checks,” 2025
work page 2025
-
[27]
Y. Ishibashi and Y. Nishimura, “Self-organized agents: A LLM multi-agent framework toward ultra large- scale code generation and optimization,” 2024
work page 2024
-
[28]
K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin, “CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,” inMeeting of the Association for Computational Linguis- tics (ACL), 2024
work page 2024
-
[29]
Toolgen: Unified tool retrieval and calling via generation,
R. Wang, X. Han, L. Ji, S. Wang, T. Baldwin, and H. Li, “Toolgen: Unified tool retrieval and calling via generation,” inInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[30]
Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic AI,
R. Sapkota, K. I. Roumeliotis, and M. Karkee, “Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic AI,” 2025
work page 2025
-
[31]
Adaptive test generation using a large language model,
M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “Adaptive test generation using a large language model,” 2023
work page 2023
-
[32]
A multi-agent llm-based juit test generation with strong oracles,
Q. Xu, G. Wang, L. Briand, and K. Liu, “A multi-agent llm-based juit test generation with strong oracles,” 2025
work page 2025
-
[33]
Leveraging llms to automate energy-aware refactor- ing of parallel scientific codes,
M. T. Dearing, Y. Tao, X. Wu, Z. Lan, and V . Taylor, “Leveraging llms to automate energy-aware refactor- ing of parallel scientific codes,” 2025
work page 2025
-
[34]
Sysllmatic: Large language models are software system optimizers,
H. Peng, A. Gupte, R. Hasler, N. J. Eliopoulos, C.- C. Ho, R. Mantri, L. Deng, K. L ¨aufer, G. K. Thiru- vathukal, and J. C. Davis, “Sysllmatic: Large language models are software system optimizers,” 2025
work page 2025
-
[35]
Harnessing large language models for seed generation in greybox 20 fuzzing,
W. Shi, Y. Zhang, X. Xing, and J. Xu, “Harnessing large language models for seed generation in greybox 20 fuzzing,” 2024
work page 2024
-
[36]
Mutation- guided llm-based test generation at meta,
C. Foster, A. Gulati, M. Harman, I. Harper, K. Mao, J. Ritchey, H. Robert, and S. Sengupta, “Mutation- guided llm-based test generation at meta,” 2025
work page 2025
-
[37]
H. Jin, L. Huang, H. Cai, J. Yan, B. Li, and H. Chen, “From LLMs to LLM-based agents for software engi- neering: A survey of current, challenges and future,” 2024
work page 2024
-
[38]
Large language model-based agents for software engineering: A survey,
J. Liu, K. Wang, Y. Chen, X. Peng, Z. Chen, L. Zhang, and Y. Lou, “Large language model-based agents for software engineering: A survey,” 2024
work page 2024
-
[39]
J. He, C. Treude, and D. Lo, “Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 5, pp. 1–30, 2025
work page 2025
-
[40]
Agents in software engineering: Survey, landscape, and vision,
Y. Wang, W. Zhong, Y. Huang, E. Shi, M. Yang, J. Chen, H. Li, Y. Ma, Q. Wang, and Z. Zheng, “Agents in software engineering: Survey, landscape, and vision,” 2024
work page 2024
-
[41]
Intellicode compose: Code generation using trans- former,
A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, “Intellicode compose: Code generation using trans- former,” inACM Joint Meeting on European Software Engineering Conference and Symposium on the Founda- tions of Software Engineering, 2020, pp. 1433–1443
work page 2020
-
[42]
Herrington,Code generation in action
J. Herrington,Code generation in action. Manning Publications Co., 2003
work page 2003
-
[43]
B. A. Becker, P . Denny, J. Finnie-Ansley, A. Luxton- Reilly, J. Prather, and E. A. Santos, “Programming is hard-or at least it used to be: Educational opportuni- ties and challenges of AI code generation,” inACM Technical Symposium on Computer Science Education V . 1 (SIGCSE), 2023, pp. 500–506
work page 2023
-
[44]
In-IDE code generation from natural language: Promise and chal- lenges,
F. F. Xu, B. Vasilescu, and G. Neubig, “In-IDE code generation from natural language: Promise and chal- lenges,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp. 1–47, 2022
work page 2022
-
[45]
N. Huynh and B. Lin, “Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,” 2025
work page 2025
-
[46]
ChatGPT for good? on opportunities and challenges of large lan- guage models for education,
E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeieret al., “ChatGPT for good? on opportunities and challenges of large lan- guage models for education,”Learning and Individual Differences, vol. 103, p. 102274, 2023
work page 2023
-
[47]
A survey on evaluation of large language models,
Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wanget al., “A survey on evaluation of large language models,”ACM transac- tions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024
work page 2024
-
[48]
A survey of large language models,
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Donget al., “A survey of large language models,” 2023
work page 2023
-
[49]
A comprehensive overview of large language models,
H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,” ACM Transactions on Intelligent Systems and Technology, 2023
work page 2023
-
[50]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inConference on Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[51]
Evaluating large language models trained on code,
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,” 2021
work page 2021
-
[52]
Deepseek-coder: When the large language model meets programming– the rise of code intelligence,
D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Liet al., “Deepseek-coder: When the large language model meets programming– the rise of code intelligence,” 2024
work page 2024
-
[53]
Qwen2. 5-coder technical report,
B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Luet al., “Qwen2. 5-coder technical report,” 2024
work page 2024
-
[54]
Chain-of-thought prompting elicits reasoning in large language mod- els,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language mod- els,” inConference on Neural Information Processing Systems (NeurIPS), 2022, pp. 24 824–24 837
work page 2022
-
[55]
Planning in natural language improves LLM search for code generation,
E. Wang, F. Cassano, C. Wu, Y. Bai, W. Song, V . Nath, Z. Han, S. Hendryx, S. Yue, and H. Zhang, “Planning in natural language improves LLM search for code generation,” inInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[56]
WebGPT: Browser-assisted question-answering with human feedback,
R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V . Kosaraju, W. Saun- ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman, “WebGPT: Browser-assisted question-answering with human feedback,” 2022
work page 2022
-
[57]
Toolformer: Language models can teach themselves to use tools,
T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inConference on Neu- ral Information Processing Systems (NeurIPS), 2023, pp. 68 539–68 551
work page 2023
-
[58]
Pal: Program-aided lan- guage models,
L. Gao, A. Madaan, S. Zhou, U. Alon, P . Liu, Y. Yang, J. Callan, and G. Neubig, “Pal: Program-aided lan- guage models,” inInternational Conference on Machine Learning (ICML), 2023
work page 2023
-
[59]
Generative agents: Interactive simulacra of human behavior,
J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P . Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inProceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1–22
work page 2023
-
[60]
N. Mehta, M. Teruel, P . F. Sanz, X. Deng, A. H. Awadallah, and J. Kiseleva, “Improving grounded language understanding in a collaborative environ- ment by interacting with agents through help feed- back,”arXiv preprint arXiv:2304.10750, 2023
-
[61]
The rise and potential of large language model based agents: A survey,
Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhouet al., “The rise and potential of large language model based agents: A survey,”Science China Information Sciences, vol. 68, no. 2, p. 121101, 2025
work page 2025
-
[62]
A survey on large language model based autonomous agents,
L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Linet al., “A survey on large language model based autonomous agents,” Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024
work page 2024
-
[63]
Large language model based multi-agents: A survey of progress and challenges,
T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V . 21 Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,” 2024
work page 2024
-
[64]
Ex- ploring large language model based intelligent agents: Definitions, methods, and prospects,
Y. Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F. Yin, J. Zhaoet al., “Ex- ploring large language model based intelligent agents: Definitions, methods, and prospects,” 2024
work page 2024
-
[65]
Prompt engineering with ChatGPT: a guide for academic writers,
L. Giray, “Prompt engineering with ChatGPT: a guide for academic writers,”Annals of biomedical engineering, vol. 51, no. 12, pp. 2629–2633, 2023
work page 2023
-
[66]
A prompt pattern catalog to enhance prompt engineering with chatgpt,
J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,” 2023
work page 2023
-
[67]
Retrieval-augmented generation for large language models: A survey,
Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” 2023
work page 2023
-
[68]
Retrieval-augmented generation for knowledge-intensive nlp tasks,
P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschelet al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” inConferecce on Neural Information Processing Systems (NeurIPS), 2020, pp. 9459–9474
work page 2020
-
[69]
Large language model-aware in-context learning for code generation,
J. Li, C. Tao, J. Li, G. Li, Z. Jin, H. Zhang, Z. Fang, and F. Liu, “Large language model-aware in-context learning for code generation,”ACM Transactions on Software Engineering and Methodology, 2023
work page 2023
-
[70]
Larger language models do in-context learning differently,
J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu, X. Chen, H. Liu, D. Huang, D. Zhouet al., “Larger language models do in-context learning differently,” 2023
work page 2023
-
[71]
AgentCoder: Multi-agent-based code generation with iterative testing and optimisation,
D. Huang, J. M. Zhang, M. Luck, Q. Bu, Y. Qing, and H. Cui, “AgentCoder: Multi-agent-based code generation with iterative testing and optimisation,” 2023
work page 2023
-
[72]
HyperAgent: Generalist software engineering agents to solve coding tasks at scale,
H. N. Phan, T. N. Nguyen, P . X. Nguyen, and N. D. Bui, “HyperAgent: Generalist software engineering agents to solve coding tasks at scale,” 2024
work page 2024
-
[73]
ToolCoder: Teach code generation models to use API search tools,
K. Zhang, H. Zhang, G. Li, J. Li, Z. Li, and Z. Jin, “ToolCoder: Teach code generation models to use API search tools,” 2023
work page 2023
-
[74]
Repohyper: Better context retrieval is all you need for repository-level code completion,
H. N. Phan, H. N. Phan, T. N. Nguyen, and N. D. Bui, “Repohyper: Better context retrieval is all you need for repository-level code completion,” 2024
work page 2024
-
[75]
Self-refine: Iterative refinement with self-feedback,
A. Madaan, N. Tandon, P . Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yanget al., “Self-refine: Iterative refinement with self-feedback,” inConference on Neural Information Pro- cessing Systems (NeurIPS), 2023, pp. 46 534–46 594
work page 2023
-
[76]
Self-Edit: Fault- aware code editor for code generation,
K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin, “Self-Edit: Fault- aware code editor for code generation,” inMeeting of the Association for Computational Linguistics (ACL), 2023, pp. 769–787
work page 2023
-
[77]
Executable code actions elicit better LLM agents,
X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji, “Executable code actions elicit better LLM agents,” inInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[78]
Knowledge-aware code generation with large lan- guage models,
T. Huang, Z. Sun, Z. Jin, G. Li, and C. Lyu, “Knowledge-aware code generation with large lan- guage models,” inIEEE/ACM International Conference on Program Comprehension (ICPC), 2024, pp. 52–63
work page 2024
-
[79]
A real-world webagent with planning, long context understanding, and program synthesis,
I. Gur, H. Furuta, A. Huang, M. Safdari, Y. Matsuo, D. Eck, and A. Faust, “A real-world webagent with planning, long context understanding, and program synthesis,” inInternational Conference on Learning Rep- resentations (ICLR), 2024
work page 2024
-
[80]
Codeplan: Repository-level coding using LLMs and planning,
R. Bairi, A. Sonwane, A. Kanade, V . D. C, A. Iyer, S. Parthasarathy, S. Rajamani, B. Ashok, and S. Shet, “Codeplan: Repository-level coding using LLMs and planning,”ACM on Software Engineering, vol. 1, no. FSE, pp. 675–698, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.