Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories

· 2024 · arXiv 2405.19856

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.

SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

cs.SE · 2026-05-25 · unverdicted · novelty 6.0

SetupX presents an experiential learning framework for LLM agents that reaches 92% pass rate on functionality-correct repository setup by transferring verified fixes across repositories via XPU representations, LIFO Docker snapshots, and Prosecutor-Judge verification.

AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation

cs.SE · 2025-06-10 · unverdicted · novelty 5.0

AdaDec improves Pass@1 accuracy of LLM code generation by up to 20.9% over greedy decoding by triggering lookahead reranking only at high-uncertainty steps on HumanEval+, MBPP+, and DevEval.

citing papers explorer

Showing 4 of 4 citing papers.

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation cs.SE · 2026-04-29 · unverdicted · none · ref 23
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy cs.CL · 2026-04-03 · unverdicted · none · ref 40
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup? cs.SE · 2026-05-25 · unverdicted · none · ref 4
SetupX presents an experiential learning framework for LLM agents that reaches 92% pass rate on functionality-correct repository setup by transferring verified fixes across repositories via XPU representations, LIFO Docker snapshots, and Prosecutor-Judge verification.
AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation cs.SE · 2025-06-10 · unverdicted · none · ref 40
AdaDec improves Pass@1 accuracy of LLM code generation by up to 20.9% over greedy decoding by triggering lookahead reranking only at high-uncertainty steps on HumanEval+, MBPP+, and DevEval.

Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer