Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
Multi-lingual evaluation of code generation models
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo generation on small projects versus module-by-module on complex ones.
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
citing papers explorer
-
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
-
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation
MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.
-
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo generation on small projects versus module-by-module on complex ones.
-
A Taxonomy of Programming Languages for Code Generation
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Evaluating LLM-Generated Code: A Benchmark and Developer Study
A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
- OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research