hub

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang · 2023 · cs.SE · arXiv 2305.01210

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

open full Pith review browse 22 citing papers arXiv PDF

abstract

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HumanEval benchmark by 80x to build HumanEval+. Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that test insufficiency can lead to mis-ranking. For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+, while none of them could on HumanEval. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing. We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

cs.CL · 2023-10-10 · unverdicted · novelty 8.0

SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.

When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

cs.SE · 2026-04-27 · unverdicted · novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

cs.SE · 2024-05-06 · unverdicted · novelty 7.0

SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.

DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation

cs.SE · 2026-05-09 · unverdicted · novelty 6.0

Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a competitive 23A2B model from Qwen3-Next-80A3B.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

cs.SE · 2026-05-06 · accept · novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

cs.SE · 2026-04-27 · conditional · novelty 6.0

SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.

You Don't Need Public Tests to Generate Correct Code

cs.SE · 2026-04-23 · unverdicted · novelty 6.0

DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.

Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation

cs.SE · 2026-04-20 · unverdicted · novelty 6.0

Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.

Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code

cs.SE · 2026-04-13 · unverdicted · novelty 6.0

Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.

Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping

cs.DC · 2026-04-12 · unverdicted · novelty 6.0

Large language models derive exact analytical GPU thread mappings for complex 2D/3D domains and fractals via in-context learning, outperforming symbolic regression and enabling up to thousands-fold speedups and energy reductions.

Ensemble-Based Uncertainty Estimation for Code Correctness Estimation

cs.SE · 2026-03-28 · unverdicted · novelty 6.0

Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

Textbooks Are All You Need

cs.CL · 2023-06-20 · unverdicted · novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

cs.AI · 2023-03-31 · conditional · novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.

Evaluating LLM-Generated Code: A Benchmark and Developer Study

cs.SE · 2026-05-09 · unverdicted · novelty 5.0

A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

cs.CL · 2025-03-03 · unverdicted · novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

citing papers explorer

Showing 22 of 22 citing papers.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? cs.CL · 2023-10-10 · unverdicted · none · ref 113 · internal anchor
SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
ProgramBench: Can Language Models Rebuild Programs From Scratch? cs.SE · 2026-05-05 · unverdicted · none · ref 10 · internal anchor
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation cs.SE · 2026-04-29 · unverdicted · none · ref 25 · internal anchor
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation cs.SE · 2026-04-27 · unverdicted · none · ref 23 · internal anchor
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering cs.SE · 2024-05-06 · unverdicted · none · ref 26 · internal anchor
SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation cs.LG · 2026-05-09 · unverdicted · none · ref 31 · internal anchor
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation cs.SE · 2026-05-09 · unverdicted · none · ref 19 · internal anchor
Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 44 · internal anchor
Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a competitive 23A2B model from Qwen3-Next-80A3B.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code cs.SE · 2026-05-06 · accept · none · ref 71 · internal anchor
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis cs.SE · 2026-04-27 · conditional · none · ref 22 · internal anchor
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specification is the most damaging defect type while richer benchmarks are more resilient.
You Don't Need Public Tests to Generate Correct Code cs.SE · 2026-04-23 · unverdicted · none · ref 14 · internal anchor
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts cs.LG · 2026-04-20 · unverdicted · none · ref 26 · internal anchor
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation cs.SE · 2026-04-20 · unverdicted · none · ref 18 · internal anchor
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code cs.SE · 2026-04-13 · unverdicted · none · ref 19 · internal anchor
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping cs.DC · 2026-04-12 · unverdicted · none · ref 30 · internal anchor
Large language models derive exact analytical GPU thread mappings for complex 2D/3D domains and fractals via in-context learning, outperforming symbolic regression and enabling up to thousands-fold speedups and energy reductions.
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation cs.SE · 2026-03-28 · unverdicted · none · ref 27 · internal anchor
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 277 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Textbooks Are All You Need cs.CL · 2023-06-20 · unverdicted · none · ref 20 · internal anchor
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society cs.AI · 2023-03-31 · conditional · none · ref 69 · internal anchor
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Evaluating LLM-Generated Code: A Benchmark and Developer Study cs.SE · 2026-05-09 · unverdicted · none · ref 9 · internal anchor
A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs cs.CL · 2025-03-03 · unverdicted · none · ref 31 · internal anchor
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 289 · internal anchor
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer