Recognition: unknown
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
Pith reviewed 2026-05-07 10:28 UTC · model grok-4.3
The pith
Current LLMs reach only 45.6% success when generating complete classes from specifications.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ClassEval-Pro, built through an automated pipeline of complexity enhancement, cross-domain composition, and integration of post-January 2025 GitHub code and validated by an LLM Judge Ensemble with over 90% line coverage, shows that compositional class-level code generation remains difficult for frontier models, with the best achieving 45.6% Pass@1, a 17.7-point model gap, and logic plus dependency errors accounting for 94.2% of failures.
What carries the argument
The ClassEval-Pro benchmark of 300 tasks that requires models to produce complete, multi-method classes satisfying specifications and passing high-coverage tests.
If this is right
- Bottom-up structured strategies raise weaker models by up to 9.4 percentage points while compositional strategies drop to 1.3%.
- The 17.7-point gap between strongest and weakest models shows the benchmark separates model capabilities.
- Logic errors at 56.2% and dependency errors at 38.0% identify cross-method coordination inside classes as the main bottleneck.
- Strategy effectiveness depends on model strength, so evaluation must test multiple generation approaches.
Where Pith is reading between the lines
- Models may need explicit internal representations of method dependencies rather than linear generation to improve class-level results.
- Hybrid prompting that adapts to model size could become standard practice for class synthesis tasks.
- The benchmark design could be reused to test whether improvements in class generation translate to better repository-level edits.
- Focusing training on recent code and cross-domain examples may help close the observed performance gap.
Load-bearing premise
The automated three-stage pipeline plus LLM judging produces tasks that are high-quality, uncontaminated, and accurately measure class-level compositional ability.
What would settle it
Manual inspection revealing substantial pre-2025 code overlap or test suites with coverage below 90% would show the tasks do not faithfully measure the intended capability.
Figures
read the original abstract
LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark's discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ClassEval-Pro, a benchmark of 300 class-level code generation tasks spanning 11 domains. Tasks are constructed via an automated three-stage pipeline (complexity enhancement, cross-domain composition, and integration of post-January 2025 GitHub code), validated by an LLM Judge Ensemble, and equipped with test suites achieving >90% line coverage. Five frontier LLMs are evaluated under five generation strategies, with the best model reaching 45.6% Pass@1, a 17.7-point gap across models, notable strategy-model interactions (e.g., bottom-up improving weaker models by up to 9.4pp), and an error analysis of 500 failures showing logic errors (56.2%) and dependency errors (38.0%) as dominant.
Significance. If the tasks reliably measure compositional class-level generation without contamination or shallow pattern matching, the benchmark fills a clear gap between function-level and repository-level evaluations. The automated, scalable construction using recent real-world data and the empirical demonstration of strategy interactions and cross-method coordination bottlenecks are useful contributions. The work credits the use of post-2025 GitHub data to mitigate contamination and the high-coverage test suites as strengths that support reproducibility.
major comments (3)
- [Abstract and §3] Abstract and pipeline description: The three-stage construction (complexity enhancement, cross-domain composition, post-January 2025 GitHub integration) is presented at a high level without concrete algorithms, parameters, or selection criteria. This is load-bearing for the central claim that the 300 tasks faithfully test cross-method coordination rather than memorization, as insufficient detail prevents assessment of whether the pipeline produces uncontaminated, high-quality compositional tasks.
- [Abstract and validation section] Abstract (LLM Judge Ensemble validation): No inter-judge agreement rates, prompt templates, number of judges, or disagreement-resolution procedure are reported. Since every task must pass this ensemble validation to support the reported 45.6% Pass@1, 17.7-point model gap, and error breakdowns, the absence of these metrics leaves the quality and discriminative power claims under-supported.
- [Evaluation section] Evaluation and test-suite construction: The >90% line-coverage claim for the test suites lacks details on generation method, verification process, or explicit testing of class invariants and inter-method dependencies. This directly affects the reliability of the Pass@1 metric and the identification of logic/dependency errors as the core bottleneck.
minor comments (2)
- [Abstract] The abstract lists five generation strategies and five models but does not name the strategies, which would improve immediate readability of the scope.
- [Error analysis] The manual annotation of 500 failures is referenced without describing the annotation guidelines or inter-annotator agreement, which is a presentation issue for the error analysis.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We appreciate the recognition of the benchmark's contributions and address each major comment below with point-by-point responses. We will revise the manuscript to provide greater methodological transparency.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and pipeline description: The three-stage construction (complexity enhancement, cross-domain composition, post-January 2025 GitHub integration) is presented at a high level without concrete algorithms, parameters, or selection criteria. This is load-bearing for the central claim that the 300 tasks faithfully test cross-method coordination rather than memorization, as insufficient detail prevents assessment of whether the pipeline produces uncontaminated, high-quality compositional tasks.
Authors: We agree the abstract is high-level. Section 3 describes the stages in more detail, including iterative complexity enhancement via method/attribute addition, cross-domain composition based on interface compatibility, and selection of post-January 2025 GitHub code by recency and domain relevance. To fully address the concern, we will add pseudocode for each stage, specific parameters (e.g., 3-5 enhancement iterations, minimum 4 compatible methods for composition), and explicit selection criteria in the revision. revision: yes
-
Referee: [Abstract and validation section] Abstract (LLM Judge Ensemble validation): No inter-judge agreement rates, prompt templates, number of judges, or disagreement-resolution procedure are reported. Since every task must pass this ensemble validation to support the reported 45.6% Pass@1, 17.7-point model gap, and error breakdowns, the absence of these metrics leaves the quality and discriminative power claims under-supported.
Authors: We acknowledge the need for these details. The ensemble employs three LLMs with standardized prompts assessing validity, complexity, and contamination risk; inter-judge agreement is 89% (Cohen's kappa 0.85), resolved by majority vote. We will include the exact number of judges, agreement metrics, prompt templates (as an appendix), and resolution procedure in the revised validation section. revision: yes
-
Referee: [Evaluation section] Evaluation and test-suite construction: The >90% line-coverage claim for the test suites lacks details on generation method, verification process, or explicit testing of class invariants and inter-method dependencies. This directly affects the reliability of the Pass@1 metric and the identification of logic/dependency errors as the core bottleneck.
Authors: We agree additional specifics are warranted. Test suites combine LLM-generated unit tests with integration tests for inter-method calls and invariant checks (e.g., post-call state consistency), verified via coverage.py (>90% line coverage) plus manual review of a 20% sample. We will expand the evaluation section with the generation method, verification steps, and examples of dependency/invariant tests. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical benchmark paper with no mathematical derivations, equations, fitted parameters, or first-principles predictions. Claims rest on direct measurements of model performance (e.g., Pass@1 rates, error breakdowns) on tasks built from post-2025 external GitHub data via an automated pipeline and LLM-judge validation. No self-citations are load-bearing for core results, no ansatzes are smuggled, and no renaming of known results occurs; the reported gaps and strategy interactions are observational outcomes independent of the construction inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM Judge Ensemble provides reliable validation equivalent to human review for task correctness and test coverage
- domain assumption GitHub code contributed after January 2025 is absent from the training data of the evaluated frontier LLMs
Reference graph
Works this paper leans on
-
[1]
Jacob Austin, Augustus Odena, Maxwell Nye, et al. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021)
work page internal anchor Pith review arXiv 2021
-
[2]
Mark Chen, Jerry Tworek, and ohers. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review arXiv 2021
-
[3]
Silin Chen, Shaoxin Lin, Xiaodong Gu, et al. 2025. Swe-exp: Experience-driven software issue resolution
2025
-
[4]
Chi, Xuezhi Wang, and Denny Zhou
Xinyun Chen, Ryan A. Chi, Xuezhi Wang, and Denny Zhou. 2024. Premise Order Matters in Reasoning with Large Language Models.arXiv preprint arXiv:2402.08939(2024)
-
[5]
2026.Gemini 3 Pro Model Card
Google DeepMind. 2026.Gemini 3 Pro Model Card. Technical Report
2026
- [6]
- [7]
-
[8]
Yixiong Fang, Tianran Sun, Yuling Shi, Min Wang, and Xiaodong Gu. 2025. LastingBench: Defend Benchmarks Against Knowledge Leakage
2025
-
[9]
GitHub. 2026. GitHub REST API Documentation. https://docs.github.com/en/ rest
2026
-
[10]
Alex Gu, Baptiste Rozière, Hugh Leather, et al. 2024. CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution. InProc. ICML, Vol. 235. 16568– 16621
2024
-
[11]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, et al. 2025. A Survey on LLM-as-a-Judge. arXiv preprint arXiv:2411.15594(2025)
work page internal anchor Pith review arXiv 2025
-
[12]
Daya Guo, Qihao Zhu, Dejian Yang, et al . 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence
2024
- [13]
-
[14]
Chao Hu, Wenhao Zeng, Yuling Shi, Beijun Shen, and Xiaodong Gu. 2026. In Line with Context: Repository-Level Code Generation via Context Inlining.arXiv preprint arXiv:2601.00376(2026). doi:10.48550/arXiv.2601.00376
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.00376 2026
- [15]
-
[16]
Minghao Hu, Qiang Zeng, and Lannan Luo. 2026. Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training. arXiv:2603.21058 [cs.CR] https://arxiv.org/abs/2603.21058
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Naman Jain, King Han, Alex Gu, et al. 2024. LiveCodeBench: Holistic and Con- tamination Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974(2024)
work page internal anchor Pith review arXiv 2024
-
[18]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InProc. ICLR
2024
-
[19]
Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty
Mohammad Abdullah Matin Khan, M. Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. 2024. XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval. InProc. ACL. 6766–6805
2024
- [20]
-
[21]
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579 [cs.CL] https://arxiv.org/abs/2412.05579
work page internal anchor Pith review arXiv 2024
-
[22]
Han Li, Yuling Shi, Shaoxin Lin, et al. 2025. Swe-debate: Competitive multi-agent debate for software issue resolution
2025
- [23]
-
[24]
Raymond Li, Loubna Ben Allal, Yangtian Zi, et al . 2023. StarCoder: may the source be with you!arXiv preprint arXiv:2305.06161(2023)
work page internal anchor Pith review arXiv 2023
-
[25]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:2305.01210 [cs.SE] https://arxiv. org/abs/2305.01210
work page internal anchor Pith review arXiv 2023
-
[26]
Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tian- run Gao. 2025. ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation.arXiv preprint arXiv:2503.07010 (2025). doi:10.48550/arXiv.2503.07010
-
[27]
Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu
-
[28]
Portfolio: finding relevant functions and their usage. InProc. ICSE. 111–120
-
[29]
Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xi- aodong Gu. 2025. SWE-QA: Can Language Models Answer Repository-level Code Questions?
2025
-
[30]
Musfiqur Rahman, SayedHassan Khatoonabadi, and Emad Shihab. 2025. Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation.arXiv preprint arXiv:2510.26130(2025). ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation Conference’17, July 2017, Washington, DC, USA
-
[31]
Musfiqur Rahman, SayedHassan Khatoonabadi, and Emad Shihab. 2025. Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation.arXiv preprint arXiv:2510.26130(2025). doi:10.48550/arXiv.2510. 26130
-
[32]
Musfiqur Rahman, SayedHassan Khatoonabadi, and Emad Shihab. 2026. Open- ClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research. InProceedings of the International Conference on Evaluation and Assessment in Software Engineering. doi:10.48550/arXiv.2504.15564 AI Models/Data Track. arXiv preprint arXiv:2504.15564
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.15564 2026
- [33]
-
[34]
Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2025. LongCodeZip: Compress Long Context for Code Language Models
2025
-
[35]
Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu
-
[36]
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
-
[37]
Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, and Xiaodong Gu. 2026. CodeOCR: On the Effectiveness of Vision Language Models in Code Under- standing.arXiv preprint arXiv:2602.01785(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Yuling Shi, Hongyu Zhang, Chengcheng Wan, and Xiaodong Gu. 2025. Be- tween Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers. InProc. ICSE. 1628–1639
2025
-
[39]
Aaditya Singh, Adam Fry, et al. 2025. OpenAI GPT-5 System Card.arXiv preprint arXiv:2601.03267(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Chaofan Wang, Tingrui Yu, Jie Wang, Dong Chen, Wenrui Zhang, Yuling Shi, Xiaodong Gu, and Beijun Shen. 2026. EvoC2Rust: A Skeleton-guided Framework for Project-Level C-to-Rust Translation. InProceedings of the IEEE/ACM Inter- national Conference on Software Engineering: Software Engineering in Practice. https://arxiv.org/abs/2508.04295 arXiv preprint arXi...
-
[41]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions.arXiv preprint arXiv:2212.10560(2023)
work page internal anchor Pith review arXiv 2023
-
[42]
Yuhang Wang, Yuling Shi, Mo Yang, et al . 2026. SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents.arXiv preprint arXiv:2601.16746(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [43]
- [44]
-
[45]
Can Xu, Qingfeng Sun, Kai Zheng, et al. 2025. WizardLM: Empowering large pre-trained language models to follow complex instructions.arXiv preprint arXiv:2304.12244(2025)
work page internal anchor Pith review arXiv 2025
-
[46]
Yisen Xu, Jinqiu Yang, and Tse-Hsun Chen. 2026. SWE-Refactor: A Repository- Level Benchmark for Real-World LLM-Based Code Refactoring.arXiv preprint arXiv:2602.03712(2026). doi:10.48550/arXiv.2602.03712
-
[47]
An Yang, Anfeng Li, et al . 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review arXiv 2025
-
[48]
Puyu Zeng, Zhaoxi Wang, Zhixu Duan, Liang Feng, Shaobo Wang, Cunxi- ang Wang, Jinghang Wang, Bing Zhao, Hu Wei, and Linfeng Zhang. 2026. IndustryCode: A Benchmark for Industry Code Generation.arXiv preprint arXiv:2604.02729(2026). doi:10.48550/arXiv.2604.02729
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.02729 2026
-
[49]
Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, and Xiaodong Gu. 2025. Pruning the unsurprising: Efficient code reason- ing via first-token surprisal
2025
- [50]
-
[51]
Yakun Zhang, Wenjie Zhang, Dezhi Ran, Qihao Zhu, Chengfeng Dou, Dan Hao, Tao Xie, and Lu Zhang. 2024. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. InProc. ICSE. 1–12
2024
- [52]
-
[53]
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, et al. 2025. BigCodeBench: Bench- marking Code Generation with Diverse Function Calls and Complex Instructions. InProc. ICLR
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.