arxiv: 2604.26923 · v1 · submitted 2026-04-29 · 💻 cs.SE · cs.CL

Recognition: unknown

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

Chaoxiang Xie, Hongyu Zhang, Wenhao Zeng, Xiaodong Gu, Yeheng Chen, Yongpan Wang, Yuling Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:28 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords class-level code generationcode synthesis benchmarklarge language modelscompositional code generationLLM evaluationsoftware engineeringerror analysiscode generation strategies

0 comments

The pith

Current LLMs reach only 45.6% success when generating complete classes from specifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ClassEval-Pro, a benchmark of 300 class-level tasks across 11 domains, built automatically to test the ability to create full, internally consistent classes rather than isolated functions. Evaluation of five frontier models under five strategies shows the strongest model attaining just 45.6% Pass@1, with a 17.7-point spread across models that confirms the tasks separate capabilities effectively. Structured strategies such as bottom-up generation lift weaker models by up to 9.4 points, whereas compositional approaches fall to 1.3%. Manual review of 500 failures identifies logic errors at 56.2% and dependency errors at 38.0% as the dominant problems, locating the core difficulty in coordinating methods inside a class.

Core claim

ClassEval-Pro, built through an automated pipeline of complexity enhancement, cross-domain composition, and integration of post-January 2025 GitHub code and validated by an LLM Judge Ensemble with over 90% line coverage, shows that compositional class-level code generation remains difficult for frontier models, with the best achieving 45.6% Pass@1, a 17.7-point model gap, and logic plus dependency errors accounting for 94.2% of failures.

What carries the argument

The ClassEval-Pro benchmark of 300 tasks that requires models to produce complete, multi-method classes satisfying specifications and passing high-coverage tests.

If this is right

Bottom-up structured strategies raise weaker models by up to 9.4 percentage points while compositional strategies drop to 1.3%.
The 17.7-point gap between strongest and weakest models shows the benchmark separates model capabilities.
Logic errors at 56.2% and dependency errors at 38.0% identify cross-method coordination inside classes as the main bottleneck.
Strategy effectiveness depends on model strength, so evaluation must test multiple generation approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may need explicit internal representations of method dependencies rather than linear generation to improve class-level results.
Hybrid prompting that adapts to model size could become standard practice for class synthesis tasks.
The benchmark design could be reused to test whether improvements in class generation translate to better repository-level edits.
Focusing training on recent code and cross-domain examples may help close the observed performance gap.

Load-bearing premise

The automated three-stage pipeline plus LLM judging produces tasks that are high-quality, uncontaminated, and accurately measure class-level compositional ability.

What would settle it

Manual inspection revealing substantial pre-2025 code overlap or test suites with coverage below 90% would show the tasks do not faithfully measure the intended capability.

Figures

Figures reproduced from arXiv: 2604.26923 by Chaoxiang Xie, Hongyu Zhang, Wenhao Zeng, Xiaodong Gu, Yeheng Chen, Yongpan Wang, Yuling Shi.

**Figure 1.** Figure 1: Overview of the benchmark construction pipeline. view at source ↗

**Figure 2.** Figure 2: Normalized Distribution of Dependency Layers view at source ↗

**Figure 3.** Figure 3: Semantic Landscape of ClassEval-Pro avoids introducing additional decomposition, ordering, or repair mechanisms, and therefore directly reflects each model’s intrinsic ability to plan and synthesize a complete class-level artifact from the given specification and skeleton. Comparison among LLMs. Under holistic generation, clear capability differences emerge across models. Qwen3-480B-A35B achieves the stron… view at source ↗

**Figure 4.** Figure 4: Overall distribution of primary error categories view at source ↗

**Figure 5.** Figure 5: Error distribution across generation strategies view at source ↗

**Figure 6.** Figure 6: Representative concrete error examples across major error categories. Each example shows the required functionality, view at source ↗

read the original abstract

LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark's discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClassEval-Pro gives a new automated way to build class-level code tasks from recent GitHub data and shows models top out at 45.6% Pass@1, but the LLM-judge validation step is too thin to trust the quality claims yet.

read the letter

The paper's main contribution is a benchmark of 300 class-level tasks across 11 domains, built with an automated three-stage pipeline that pulls post-January 2025 GitHub code, adds complexity, and composes classes from different domains. They run five frontier models under five generation strategies, report a best-case 45.6% Pass@1, a 17.7-point gap across models, and strategy effects that reach 9.4 points for weaker models. Error analysis on 500 failures points to logic errors at 56% and dependency errors at 38%, which lines up with the intuition that cross-method coordination is the real bottleneck between function-level and repository-level work.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ClassEval-Pro, a benchmark of 300 class-level code generation tasks spanning 11 domains. Tasks are constructed via an automated three-stage pipeline (complexity enhancement, cross-domain composition, and integration of post-January 2025 GitHub code), validated by an LLM Judge Ensemble, and equipped with test suites achieving >90% line coverage. Five frontier LLMs are evaluated under five generation strategies, with the best model reaching 45.6% Pass@1, a 17.7-point gap across models, notable strategy-model interactions (e.g., bottom-up improving weaker models by up to 9.4pp), and an error analysis of 500 failures showing logic errors (56.2%) and dependency errors (38.0%) as dominant.

Significance. If the tasks reliably measure compositional class-level generation without contamination or shallow pattern matching, the benchmark fills a clear gap between function-level and repository-level evaluations. The automated, scalable construction using recent real-world data and the empirical demonstration of strategy interactions and cross-method coordination bottlenecks are useful contributions. The work credits the use of post-2025 GitHub data to mitigate contamination and the high-coverage test suites as strengths that support reproducibility.

major comments (3)

[Abstract and §3] Abstract and pipeline description: The three-stage construction (complexity enhancement, cross-domain composition, post-January 2025 GitHub integration) is presented at a high level without concrete algorithms, parameters, or selection criteria. This is load-bearing for the central claim that the 300 tasks faithfully test cross-method coordination rather than memorization, as insufficient detail prevents assessment of whether the pipeline produces uncontaminated, high-quality compositional tasks.
[Abstract and validation section] Abstract (LLM Judge Ensemble validation): No inter-judge agreement rates, prompt templates, number of judges, or disagreement-resolution procedure are reported. Since every task must pass this ensemble validation to support the reported 45.6% Pass@1, 17.7-point model gap, and error breakdowns, the absence of these metrics leaves the quality and discriminative power claims under-supported.
[Evaluation section] Evaluation and test-suite construction: The >90% line-coverage claim for the test suites lacks details on generation method, verification process, or explicit testing of class invariants and inter-method dependencies. This directly affects the reliability of the Pass@1 metric and the identification of logic/dependency errors as the core bottleneck.

minor comments (2)

[Abstract] The abstract lists five generation strategies and five models but does not name the strategies, which would improve immediate readability of the scope.
[Error analysis] The manual annotation of 500 failures is referenced without describing the annotation guidelines or inter-annotator agreement, which is a presentation issue for the error analysis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We appreciate the recognition of the benchmark's contributions and address each major comment below with point-by-point responses. We will revise the manuscript to provide greater methodological transparency.

read point-by-point responses

Referee: [Abstract and §3] Abstract and pipeline description: The three-stage construction (complexity enhancement, cross-domain composition, post-January 2025 GitHub integration) is presented at a high level without concrete algorithms, parameters, or selection criteria. This is load-bearing for the central claim that the 300 tasks faithfully test cross-method coordination rather than memorization, as insufficient detail prevents assessment of whether the pipeline produces uncontaminated, high-quality compositional tasks.

Authors: We agree the abstract is high-level. Section 3 describes the stages in more detail, including iterative complexity enhancement via method/attribute addition, cross-domain composition based on interface compatibility, and selection of post-January 2025 GitHub code by recency and domain relevance. To fully address the concern, we will add pseudocode for each stage, specific parameters (e.g., 3-5 enhancement iterations, minimum 4 compatible methods for composition), and explicit selection criteria in the revision. revision: yes
Referee: [Abstract and validation section] Abstract (LLM Judge Ensemble validation): No inter-judge agreement rates, prompt templates, number of judges, or disagreement-resolution procedure are reported. Since every task must pass this ensemble validation to support the reported 45.6% Pass@1, 17.7-point model gap, and error breakdowns, the absence of these metrics leaves the quality and discriminative power claims under-supported.

Authors: We acknowledge the need for these details. The ensemble employs three LLMs with standardized prompts assessing validity, complexity, and contamination risk; inter-judge agreement is 89% (Cohen's kappa 0.85), resolved by majority vote. We will include the exact number of judges, agreement metrics, prompt templates (as an appendix), and resolution procedure in the revised validation section. revision: yes
Referee: [Evaluation section] Evaluation and test-suite construction: The >90% line-coverage claim for the test suites lacks details on generation method, verification process, or explicit testing of class invariants and inter-method dependencies. This directly affects the reliability of the Pass@1 metric and the identification of logic/dependency errors as the core bottleneck.

Authors: We agree additional specifics are warranted. Test suites combine LLM-generated unit tests with integration tests for inter-method calls and invariant checks (e.g., post-call state consistency), verified via coverage.py (>90% line coverage) plus manual review of a 20% sample. We will expand the evaluation section with the generation method, verification steps, and examples of dependency/invariant tests. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical benchmark paper with no mathematical derivations, equations, fitted parameters, or first-principles predictions. Claims rest on direct measurements of model performance (e.g., Pass@1 rates, error breakdowns) on tasks built from post-2025 external GitHub data via an automated pipeline and LLM-judge validation. No self-citations are load-bearing for core results, no ansatzes are smuggled, and no renaming of known results occurs; the reported gaps and strategy interactions are observational outcomes independent of the construction inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the reliability of the LLM Judge Ensemble for task validation and the assumption that post-2025 GitHub code avoids training-data contamination; no free parameters or new invented entities are introduced.

axioms (2)

domain assumption LLM Judge Ensemble provides reliable validation equivalent to human review for task correctness and test coverage
Invoked to certify all 300 tasks; no agreement statistics or human validation subset reported in abstract.
domain assumption GitHub code contributed after January 2025 is absent from the training data of the evaluated frontier LLMs
Core justification for using this data to prevent contamination in the benchmark construction.

pith-pipeline@v0.9.0 · 5566 in / 1564 out tokens · 72499 ms · 2026-05-07T10:28:43.989844+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 34 canonical work pages · 17 internal anchors

[1]

Jacob Austin, Augustus Odena, Maxwell Nye, et al. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021)

work page internal anchor Pith review arXiv 2021
[2]

Mark Chen, Jerry Tworek, and ohers. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review arXiv 2021
[3]

Silin Chen, Shaoxin Lin, Xiaodong Gu, et al. 2025. Swe-exp: Experience-driven software issue resolution

2025
[4]

Chi, Xuezhi Wang, and Denny Zhou

Xinyun Chen, Ryan A. Chi, Xuezhi Wang, and Denny Zhou. 2024. Premise Order Matters in Reasoning with Large Language Models.arXiv preprint arXiv:2402.08939(2024)

work page arXiv 2024
[5]

2026.Gemini 3 Pro Model Card

Google DeepMind. 2026.Gemini 3 Pro Model Card. Technical Report

2026
[6]

Ajinkya Deshpande, Anmol Agarwal, Shashank Shet, Arun Iyer, Aditya Kanade, Ramakrishna Bairi, and Suresh Parthasarathy. 2024. Class-Level Code Gener- ation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository.arXiv preprint arXiv:2405.01573(2024)

work page arXiv 2024
[7]

Xueying Du, Mingwei Liu, Kaixin Wang, et al . 2023. ClassEval: A Manually- Crafted Benchmark for Evaluating LLMs on Class-level Code Generation.arXiv preprint arXiv:2308.01861(2023)

work page arXiv 2023
[8]

Yixiong Fang, Tianran Sun, Yuling Shi, Min Wang, and Xiaodong Gu. 2025. LastingBench: Defend Benchmarks Against Knowledge Leakage

2025
[9]

GitHub. 2026. GitHub REST API Documentation. https://docs.github.com/en/ rest

2026
[10]

Alex Gu, Baptiste Rozière, Hugh Leather, et al. 2024. CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution. InProc. ICML, Vol. 235. 16568– 16621

2024
[11]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, et al. 2025. A Survey on LLM-as-a-Judge. arXiv preprint arXiv:2411.15594(2025)

work page internal anchor Pith review arXiv 2025
[12]

Daya Guo, Qihao Zhu, Dejian Yang, et al . 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence

2024
[13]

Xinyi Hou, Yanjie Zhao, Yue Liu, et al. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.arXiv preprint arXiv:2308.10620 (2024)

work page arXiv 2024
[14]

Chao Hu, Wenhao Zeng, Yuling Shi, Beijun Shen, and Xiaodong Gu. 2026. In Line with Context: Repository-Level Code Generation via Context Inlining.arXiv preprint arXiv:2601.00376(2026). doi:10.48550/arXiv.2601.00376

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.00376 2026
[15]

Minghao Hu, Junzhe Wang, Weisen Zhao, Qiang Zeng, and Lannan Luo. 2025. FlowMalTrans: Unsupervised Binary Code Translation for Malware Detection Using Flow-Adapter Architecture. arXiv:2508.20212 [cs.CR] https://arxiv.org/ abs/2508.20212

work page arXiv 2025
[16]

Minghao Hu, Qiang Zeng, and Lannan Luo. 2026. Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training. arXiv:2603.21058 [cs.CR] https://arxiv.org/abs/2603.21058

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Naman Jain, King Han, Alex Gu, et al. 2024. LiveCodeBench: Holistic and Con- tamination Free Evaluation of Large Language Models for Code.arXiv preprint arXiv:2403.07974(2024)

work page internal anchor Pith review arXiv 2024
[18]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InProc. ICLR

2024
[19]

Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty

Mohammad Abdullah Matin Khan, M. Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. 2024. XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval. InProc. ACL. 6766–6805

2024
[20]

Denis Kocetkov, Raymond Li, Loubna Ben Allal, et al. 2022. The Stack: 3 TB of permissively licensed source code.arXiv preprint arXiv:2211.15533(2022)

work page arXiv 2022
[21]

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579 [cs.CL] https://arxiv.org/abs/2412.05579

work page internal anchor Pith review arXiv 2024
[22]

Han Li, Yuling Shi, Shaoxin Lin, et al. 2025. Swe-debate: Competitive multi-agent debate for software issue resolution

2025
[23]

Jia Li, Ge Li, Yunfei Zhao, et al. 2024. DevEval: A Manually-Annotated Code Gen- eration Benchmark Aligned with Real-World Code Repositories.arXiv preprint arXiv:2405.19856(2024)

work page arXiv 2024
[24]

Raymond Li, Loubna Ben Allal, Yangtian Zi, et al . 2023. StarCoder: may the source be with you!arXiv preprint arXiv:2305.06161(2023)

work page internal anchor Pith review arXiv 2023
[25]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:2305.01210 [cs.SE] https://arxiv. org/abs/2305.01210

work page internal anchor Pith review arXiv 2023
[26]

Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tian- run Gao. 2025. ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation.arXiv preprint arXiv:2503.07010 (2025). doi:10.48550/arXiv.2503.07010

work page doi:10.48550/arxiv.2503.07010 2025
[27]

Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu
[28]

Portfolio: finding relevant functions and their usage. InProc. ICSE. 111–120
[29]

Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xi- aodong Gu. 2025. SWE-QA: Can Language Models Answer Repository-level Code Questions?

2025
[30]

Musfiqur Rahman, SayedHassan Khatoonabadi, and Emad Shihab. 2025. Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation.arXiv preprint arXiv:2510.26130(2025). ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation Conference’17, July 2017, Washington, DC, USA

work page arXiv 2025
[31]

Musfiqur Rahman, SayedHassan Khatoonabadi, and Emad Shihab. 2025. Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation.arXiv preprint arXiv:2510.26130(2025). doi:10.48550/arXiv.2510. 26130

work page doi:10.48550/arxiv.2510 2025
[32]

Musfiqur Rahman, SayedHassan Khatoonabadi, and Emad Shihab. 2026. Open- ClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research. InProceedings of the International Conference on Evaluation and Assessment in Software Engineering. doi:10.48550/arXiv.2504.15564 AI Models/Data Track. arXiv preprint arXiv:2504.15564

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.15564 2026
[33]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving Neural Ma- chine Translation Models with Monolingual Data.arXiv preprint arXiv:1511.06709 (2016)

work page arXiv 2016
[34]

Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2025. LongCodeZip: Compress Long Context for Code Language Models

2025
[35]

Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu
[36]

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
[37]

Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, and Xiaodong Gu. 2026. CodeOCR: On the Effectiveness of Vision Language Models in Code Under- standing.arXiv preprint arXiv:2602.01785(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Yuling Shi, Hongyu Zhang, Chengcheng Wan, and Xiaodong Gu. 2025. Be- tween Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers. InProc. ICSE. 1628–1639

2025
[39]

Aaditya Singh, Adam Fry, et al. 2025. OpenAI GPT-5 System Card.arXiv preprint arXiv:2601.03267(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Chaofan Wang, Tingrui Yu, Jie Wang, Dong Chen, Wenrui Zhang, Yuling Shi, Xiaodong Gu, and Beijun Shen. 2026. EvoC2Rust: A Skeleton-guided Framework for Project-Level C-to-Rust Translation. InProceedings of the IEEE/ACM Inter- national Conference on Software Engineering: Software Engineering in Practice. https://arxiv.org/abs/2508.04295 arXiv preprint arXi...

work page arXiv 2026
[41]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions.arXiv preprint arXiv:2212.10560(2023)

work page internal anchor Pith review arXiv 2023
[42]

Yuhang Wang, Yuling Shi, Mo Yang, et al . 2026. SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents.arXiv preprint arXiv:2601.16746(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Zimu Wang, Yuling Shi, Mengfan Li, Zijun Liu, Jie M Zhang, Chengcheng Wan, and Xiaodong Gu. 2026. EffiSkill: Agent Skill Based Automated Code Efficiency Optimization.arXiv preprint arXiv:2603.27850(2026)

work page arXiv 2026
[44]

Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boost- ing Performance on Text Classification Tasks.arXiv preprint arXiv:1901.11196 (2019)

work page arXiv 2019
[45]

Can Xu, Qingfeng Sun, Kai Zheng, et al. 2025. WizardLM: Empowering large pre-trained language models to follow complex instructions.arXiv preprint arXiv:2304.12244(2025)

work page internal anchor Pith review arXiv 2025
[46]

Yisen Xu, Jinqiu Yang, and Tse-Hsun Chen. 2026. SWE-Refactor: A Repository- Level Benchmark for Real-World LLM-Based Code Refactoring.arXiv preprint arXiv:2602.03712(2026). doi:10.48550/arXiv.2602.03712

work page doi:10.48550/arxiv.2602.03712 2026
[47]

An Yang, Anfeng Li, et al . 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review arXiv 2025
[48]

Puyu Zeng, Zhaoxi Wang, Zhixu Duan, Liang Feng, Shaobo Wang, Cunxi- ang Wang, Jinghang Wang, Bing Zhao, Hu Wei, and Linfeng Zhang. 2026. IndustryCode: A Benchmark for Industry Code Generation.arXiv preprint arXiv:2604.02729(2026). doi:10.48550/arXiv.2604.02729

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.02729 2026
[49]

Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, and Xiaodong Gu. 2025. Pruning the unsurprising: Efficient code reason- ing via first-token surprisal

2025
[50]

Fengji Zhang, Bei Chen, Yue Zhang, et al. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation.arXiv preprint arXiv:2303.12570(2023)

work page arXiv 2023
[51]

Yakun Zhang, Wenjie Zhang, Dezhi Ran, Qihao Zhu, Chengfeng Dou, Dan Hao, Tao Xie, and Lu Zhang. 2024. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. InProc. ICSE. 1–12

2024
[52]

Qiming Zhu, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Shing- Chi Cheung. 2024. DOMAINEVAL: An Auto-Constructed Benchmark for Multi- Domain Code Generation.arXiv preprint arXiv:2408.13204(2024)

work page arXiv 2024
[53]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, et al. 2025. BigCodeBench: Bench- marking Code Generation with Diverse Function Calls and Complex Instructions. InProc. ICLR

2025