ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
arXiv:2405.19856 [cs.CL] https://arxiv.org/abs/2405.19856
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
citing papers explorer
-
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
-
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.