Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.
hub
Multi-lingual evaluation of code generation models
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
polarities
background 4representative citing papers
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
OpenClassGen supplies 324,843 real-world Python classes with self-contained skeletons and static metrics to support LLM class generation research and evaluation.
Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.
New benchmark for LLM performance on incremental AWS CDK infrastructure code edits shows top models succeed in 34% of cases.
CodegenBench shows LLMs generate optimized code well for x86_64 but exhibit significant performance degradation on Sunway and Kunpeng due to limited documentation and training data.
Introduces a three-fold benchmark for LLM-generated code combining correctness testing on a complex project, quality verification, and developer surveys to assess production readiness.
MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future research directions with 18 subcategories.
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
citing papers explorer
-
Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs
Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.
-
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
-
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research
OpenClassGen supplies 324,843 real-world Python classes with self-contained skeletons and static metrics to support LLM class generation research and evaluation.
-
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.
-
SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code
New benchmark for LLM performance on incremental AWS CDK infrastructure code edits shows top models succeed in 34% of cases.
-
CodegenBench: Can LLMs Write Efficient Code Across Architectures?
CodegenBench shows LLMs generate optimized code well for x86_64 but exhibit significant performance degradation on Sunway and Kunpeng due to limited documentation and training data.
-
Evaluating LLM-Generated Code: A Benchmark and Developer Study
Introduces a three-fold benchmark for LLM-generated code combining correctness testing on a complex project, quality verification, and developer surveys to assess production readiness.
-
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation
MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.
-
A Taxonomy of Programming Languages for Code Generation
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
-
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
-
Mutation-Guided Unit Test Generation with a Large Language Model
MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.
-
A Study of LLMs' Preferences for Libraries and Programming Languages
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future research directions with 18 subcategories.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
- RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices