Multi-lingual evaluation of code generation models

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al · 2022 · arXiv 2210.14868

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

representative citing papers

Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

cs.SE · 2026-04-23 · conditional · novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.

MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation

cs.SE · 2026-04-27 · unverdicted · novelty 6.0

MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

cs.SE · 2026-04-24 · unverdicted · novelty 6.0

RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo generation on small projects versus module-by-module on complex ones.

A Taxonomy of Programming Languages for Code Generation

cs.CL · 2026-03-31 · accept · novelty 6.0

The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

Evaluating LLM-Generated Code: A Benchmark and Developer Study

cs.SE · 2026-05-09 · unverdicted · novelty 5.0

A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

A Survey on Large Language Models for Code Generation

cs.CL · 2024-06-01 · unverdicted · novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research

cs.SE · 2025-04-22

citing papers explorer

Showing 9 of 9 citing papers.

Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation cs.SE · 2026-04-23 · conditional · none · ref 3
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation cs.SE · 2026-04-27 · unverdicted · none · ref 40
MEMCoder boosts LLM code generation for private libraries by 16.31% pass@1 via a multi-dimensional evolving memory that distills usage guidelines from execution feedback and combines them with static docs.
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices cs.SE · 2026-04-24 · unverdicted · none · ref 6
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo generation on small projects versus module-by-module on complex ones.
A Taxonomy of Programming Languages for Code Generation cs.CL · 2026-03-31 · accept · none · ref 1
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 242
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Evaluating LLM-Generated Code: A Benchmark and Developer Study cs.SE · 2026-05-09 · unverdicted · none · ref 1
A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 183
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 16
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research cs.SE · 2025-04-22 · unreviewed · ref 9

Multi-lingual evaluation of code generation models

fields

years

verdicts

representative citing papers

citing papers explorer