pith. machine review for the scientific record. sign in

arxiv: 2604.22659 · v1 · submitted 2026-04-24 · 💻 cs.SE

Recognition: unknown

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:21 UTC · model grok-4.3

classification 💻 cs.SE
keywords repo-level code generationlarge language modelsUML diagramscode generation benchmarksoftware development practicesLLM evaluationrepository generation
0
0 comments X

The pith

RealBench pairs natural language requirements with UML diagrams to test LLMs on generating full code repositories the way industry teams receive specs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RealBench, a benchmark for repository-level code generation that supplies each task with both natural language requirements and UML diagrams representing the system design. This format is intended to match how developers actually receive specifications in enterprise and team settings, unlike earlier benchmarks that use only raw text descriptions. Evaluation of advanced LLMs on RealBench shows sharply lower performance overall, with large differences between models. The models can locate and produce the modules indicated in the diagrams, yet the generated code is frequently undermined by grammar and logic errors. Whole-repository generation at once works best on smaller projects, while breaking the task into modules improves results on complex ones.

Core claim

RealBench is a repository-level code generation benchmark in which each example supplies natural language requirements together with UML diagrams as the system design. Testing reveals that LLMs achieve markedly lower performance on these tasks than on simpler benchmarks, accompanied by substantial gaps across different models. The models readily identify and instantiate the modules defined in the UML diagrams but produce code that often contains grammar and logic errors. Generating the entire repository in a single pass proves the strongest strategy for smaller repositories, whereas a module-by-module strategy yields better results for larger and more complex repositories.

What carries the argument

RealBench benchmark, in which each task pairs natural language requirements with UML diagrams to represent system design and thereby simulate real-world specification delivery.

If this is right

  • LLMs exhibit substantially lower performance and wide gaps between models on repository-scale code generation.
  • LLMs reliably detect modules from UML diagrams but produce code containing frequent grammar and logic errors.
  • Generating the full repository in one step is the best approach for smaller repositories.
  • Generating module by module is the better strategy for complex repositories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could adopt RealBench scores to choose which LLM to deploy for projects of different sizes.
  • LLM training focused on structured inputs such as UML could narrow the observed quality gap.
  • Similar benchmarks that incorporate other design documents beyond UML could test whether the size-dependent strategy pattern holds more broadly.

Load-bearing premise

The examples built from natural language requirements plus UML diagrams accurately reflect how developers typically receive specifications in enterprise applications and team development.

What would settle it

A direct comparison that measures whether higher RealBench scores for a given LLM predict larger measured productivity gains when that LLM assists developers on live enterprise projects would falsify the alignment claim if no correlation appears.

Figures

Figures reproduced from arXiv: 2604.22659 by Ge Li, Hongyi Deng, Jia Li, Kechi Zhang, Tiankuo Zhao, Tianqi Shao, Weinan Wang, Yang Liu, Yihong Dong, Yingtao Fang, Yiran Zhang, Zhi Jin.

Figure 1
Figure 1. Figure 1: An Example for a Code Generation Task in RealBench. view at source ↗
Figure 2
Figure 2. Figure 2: The Construction Procedure of RealBench. view at source ↗
Figure 3
Figure 3. Figure 3: A straightforward setting asks LLMs to generate the entire repository all at once with the view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation design for assessing the generated repository. It contains repository-level and class-level view at source ↗
Figure 5
Figure 5. Figure 5: Module Deficiency and Module Redundancy view at source ↗
Figure 9
Figure 9. Figure 9: Error Type Distribution of Generated Real view at source ↗
read the original abstract

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and EvoCodeBench have been created to evaluate LLMs by requiring them to generate code from natural language requirements. However, in enterprise applications and team development, developers typically write code based on structured designs or specifications rather than raw natural language descriptions. This gap between existing benchmarks and real industry development practices means that current benchmark scores may not accurately reflect how much code generation can help automate software development tasks. To address this gap, we propose RealBench, a repository-level code generation benchmark aligned with real-world industry software development practices. Each example includes both natural language requirements and UML diagrams as system design, matching how developers typically receive specifications. Based on the constructed benchmarks, we conduct a systematic evaluation of advanced LLMs' code generation capabilities when provided with structured system designs. The experimental results reveal key insights in current LLMs' capabilities for repo-level code generation aligned with real-world software development practices. First, we notice that regarding repo-level code generation, LLMs show much worse performance and there are significant performance gaps among LLMs. Second, LLMs are good at finding and creating modules defined in UML diagrams, but the quality of generated modules is often poor due to grammar and logic errors. Third, generating the entire repository at once is the best generation strategy on smaller repositories, while generating a complex repository with the module-by-module strategy works better compared to other strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RealBench, a repository-level code generation benchmark that augments natural language requirements with UML diagrams to align more closely with how developers receive specifications in industry settings. It evaluates several advanced LLMs on tasks involving generating code for entire repositories or modules, reporting that LLMs perform poorly overall with large gaps between models, excel at identifying UML-defined modules but generate low-quality code with errors, and that whole-repo generation is preferable for small repos while modular is better for complex ones.

Significance. Should the benchmark's construction methodology prove representative of real enterprise specification practices, these findings would offer valuable guidance on the current limitations of LLMs for practical software automation and suggest context-dependent prompting strategies. The work builds on prior benchmarks by emphasizing structured designs, potentially improving the predictive power of evaluations for real-world impact.

major comments (2)
  1. [Abstract and Benchmark Construction section] The central motivation and interpretation of results rest on the claim that each RealBench example 'includes both natural language requirements and UML diagrams as system design, matching how developers typically receive specifications' (abstract). However, no section describes the sourcing or authoring of the UML diagrams, provides expert validation, developer surveys, or comparisons to real artifacts such as Jira tickets, Confluence pages, or UML present in open-source repositories. Without this, the external validity of the reported performance gaps, module quality issues, and strategy preferences cannot be established.
  2. [Experimental Results section] The experimental results section summarizes high-level findings (e.g., 'much worse performance', 'significant performance gaps', 'poor quality due to grammar and logic errors') but the abstract and available description omit key details including dataset size, number of examples per repository, exact LLMs and metrics used, error rate breakdowns, or statistical tests. This makes it difficult to assess the strength of the three directional claims.
minor comments (2)
  1. [Results discussion] Clarify the precise definition of 'grammar and logic errors' in generated modules and how they were measured (e.g., via compilation checks, test suites, or manual review).
  2. [Evaluation] Add a table or figure summarizing per-model performance metrics to support the 'significant gaps' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important areas for improving the manuscript's clarity on benchmark construction and experimental reporting. We address each major comment below and will incorporate revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract and Benchmark Construction section] The central motivation and interpretation of results rest on the claim that each RealBench example 'includes both natural language requirements and UML diagrams as system design, matching how developers typically receive specifications' (abstract). However, no section describes the sourcing or authoring of the UML diagrams, provides expert validation, developer surveys, or comparisons to real artifacts such as Jira tickets, Confluence pages, or UML present in open-source repositories. Without this, the external validity of the reported performance gaps, module quality issues, and strategy preferences cannot be established.

    Authors: We agree that a more explicit description of the UML diagram construction process is needed to support the alignment claim. In the revised manuscript, we will add a dedicated subsection under 'Benchmark Construction' that details the authoring process: the UML diagrams were created manually by the authors using standard UML notation (class, sequence, and component diagrams) to represent typical system architectures drawn from publicly documented open-source repository structures and general industry software design guidelines. We will also acknowledge the absence of formal developer surveys or direct artifact comparisons (due to the proprietary nature of many enterprise specifications) and discuss this as a limitation, while citing supporting literature on how structured designs are used in practice. This revision will clarify the methodology without overstating external validity. revision: yes

  2. Referee: [Experimental Results section] The experimental results section summarizes high-level findings (e.g., 'much worse performance', 'significant performance gaps', 'poor quality due to grammar and logic errors') but the abstract and available description omit key details including dataset size, number of examples per repository, exact LLMs and metrics used, error rate breakdowns, or statistical tests. This makes it difficult to assess the strength of the three directional claims.

    Authors: We appreciate the referee's point on reporting completeness. While the full Experimental Setup and Results sections contain the underlying data, we agree the high-level summaries and abstract could be more precise. In the revision, we will expand the Experimental Results section to explicitly report: the total number of repositories and examples in RealBench, the exact LLMs evaluated along with prompting configurations, the full set of metrics (including any code quality and error analysis measures), quantitative error rate breakdowns by type (grammar, logic, etc.), and any statistical tests used to support the directional claims. The abstract will be updated to reference these elements at a high level. These changes will make the evidence for the three findings more transparent and verifiable. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark proposal with no derivation chain or circular reductions

full rationale

The paper proposes RealBench as a new repo-level benchmark that augments natural-language requirements with UML diagrams, then reports direct experimental observations of LLM performance on the resulting examples. No equations, fitted parameters, predictions, or mathematical derivations appear in the abstract or described structure. Claims about performance gaps, module-finding ability, and generation strategies rest on observed outputs rather than any self-referential definition or self-citation load-bearing step. The asserted real-world alignment is an input premise, not a derived result that loops back to the benchmark construction itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a new benchmark rather than a derivation, so the ledger contains only domain assumptions about real-world development practices and no free parameters or invented entities.

axioms (1)
  • domain assumption Developers in enterprise and team settings typically receive specifications as structured designs such as UML diagrams rather than raw natural language descriptions alone.
    This premise is invoked to justify why existing benchmarks are misaligned and why adding UML improves realism.

pith-pipeline@v0.9.0 · 5618 in / 1448 out tokens · 24549 ms · 2026-05-08T11:21:06.472308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 24 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4o.https://openai.com/index/hello-gpt-4o/(2024)

    2024. GPT-4o.https://openai.com/index/hello-gpt-4o/(2024)

  2. [2]

    Qwen2.5-Coder-7B.https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct(2024)

    2024. Qwen2.5-Coder-7B.https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct(2024)

  3. [3]

    Claude Sonnet 4.https://www.anthropic.com/claude/sonnet(2025)

    2025. Claude Sonnet 4.https://www.anthropic.com/claude/sonnet(2025)

  4. [4]

    Gemini-2.5-flash.https://ai.google.dev/gemini-api/docs/models?hl=zh-cn#gemini-2.5-flash(2025)

    2025. Gemini-2.5-flash.https://ai.google.dev/gemini-api/docs/models?hl=zh-cn#gemini-2.5-flash(2025)

  5. [5]

    Qwen3-235B-A22B.https://huggingface.co/Qwen/Qwen3-235B-A22B(2025)

    2025. Qwen3-235B-A22B.https://huggingface.co/Qwen/Qwen3-235B-A22B(2025)

  6. [6]

    Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. 2022. Multi-lingual evaluation of code generation models.arXiv preprint arXiv:2210.14868(2022)

  7. [7]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732 (2021)

  8. [8]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al

  9. [9]

    Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

  10. [10]

    2025.TIOBE Index

    Tiobe Software BV. 2025.TIOBE Index. https://www.tiobe.com/tiobe-index/

  11. [11]

    Jialun Cao, Zhiyong Chen, Jiarong Wu, Shing-Chi Cheung, and Chang Xu. 2024. Javabench: A benchmark of object- oriented code generation for evaluating large language models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 870–882

  12. [12]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  13. [13]

    Wei Cheng, Yuhan Wu, and Wei Hu. 2024. Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7957–7977

  14. [14]

    Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2023. Crosscodeeval: A diverse and multilingual benchmark for cross-file , Vol. 1, No. 1, Article . Publication date: April 2026. 20 Jia Li, Hongyi Deng, Yiran Zhang, Kechi Zhang, Tianqi Shao, Tiankuo...

  15. [15]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861(2023)

  16. [16]

    Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. 2023. Longcoder: A long-range pre-trained language model for code completion. InInternational Conference on Machine Learning. PMLR, 12098–12107

  17. [17]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024)

  18. [18]

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938 (2021)

  19. [19]

    Rasha Ahmad Husein, Hala Aburajouh, and Cagatay Catal. 2025. Large language models for code completion: A systematic literature review.Computer Standards & Interfaces92 (2025), 103917

  20. [20]

    Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping language to code in programmatic context.arXiv preprint arXiv:1808.09588(2018)

  21. [21]

    Maliheh Izadi, Roberta Gismondi, and Georgios Gousios. 2022. Codefill: Multi-token code completion by jointly learning from structure and naming sequences. InProceedings of the 44th international conference on software engineering. 401– 412

  22. [22]

    Maliheh Izadi, Jonathan Katzy, Tim Van Dam, Marc Otten, Razvan Mihai Popescu, and Arie Van Deursen. 2024. Language models for code completion: A practical evaluation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  23. [23]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation.arXiv preprint arXiv:2406.00515(2024)

  24. [24]

    Siyuan Jiang, Jia Li, He Zong, Huanyu Liu, Hao Zhu, Shukai Hu, Erlu Li, Jiazheng Ding, Yu Han, Wei Ning, et al. 2024. aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing.arXiv preprint arXiv:2410.13187 (2024)

  25. [25]

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning. PMLR, 18319–18345

  26. [26]

    2012.Applying UML and patterns: an introduction to object oriented analysis and design and interative development

    Craig Larman. 2012.Applying UML and patterns: an introduction to object oriented analysis and design and interative development. Pearson Education India

  27. [27]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

  28. [28]

    Jia Li, Xuyuan Guo, Lei Li, Kechi Zhang, Ge Li, Zhengwei Tao, Fang Liu, Chongyang Tao, Yuqi Zhu, and Zhi Jin

  29. [29]

    LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding.arXiv preprint arXiv:2503.04359(2025)

  30. [30]

    Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. 2024. Evocodebench: An evolving code generation benchmark aligned with real-world code repositories.arXiv preprint arXiv:2404.00599(2024)

  31. [31]

    Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Zhi Jin, Hao Zhu, Huanyu Liu, Kaibo Liu, Lecheng Wang, Zheng Fang, et al

  32. [32]

    Deveval: Evaluating code generation in practical software projects.arXiv preprint arXiv:2401.06401(2024)

  33. [33]

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode.Science378, 6624 (2022), 1092–1097

  34. [34]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  35. [35]

    Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi Jin, and Qianxiang Wang. 2024. Graphcoder: Enhancing repository-level code completion via code context graph-based retrieval and language model.arXiv preprint arXiv:2406.07003(2024)

  36. [36]

    Agile Manifesto. 2001. Manifesto for agile software development

  37. [37]

    Phuong T Nguyen, Juri Di Rocco, Davide Di Ruscio, Lina Ochoa, Thomas Degueule, and Massimiliano Di Penta. 2019. Focus: A recommender system for mining api function calls and usage patterns. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1050–1060

  38. [38]

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474 (2022). , Vol. 1, No. 1, Article . Publication date: April 2026. RealBench: A Repo-Level Code Generation Benchmark Aligned with R...

  39. [39]

    2017.Unified Modeling Language (UML) Version 2.5.1

    Object Management Group. 2017.Unified Modeling Language (UML) Version 2.5.1. Specification. Object Management Group. https://www.omg.org/spec/UML/2.5.1/

  40. [40]

    Pressman and B.R

    R.S. Pressman and B.R. Maxim. 2019.Software Engineering: A Practitioner’s Approach. McGraw-Hill Education

  41. [41]

    Winston W Royce. 1987. Managing the development of large software systems: concepts and techniques. InProceedings of the 9th international conference on Software Engineering. 328–338

  42. [42]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri Garriga-Alonso, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on machine learning research(2023)

  43. [43]

    Shuai Wang, Liang Ding, Li Shen, Yong Luo, Bo Du, and Dacheng Tao. 2024. Oop: Object-oriented programming evaluation benchmark for large language models.arXiv preprint arXiv:2401.06628(2024)

  44. [44]

    Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. 2022. Execution-based evaluation for open-domain code generation.arXiv preprint arXiv:2212.10481(2022)

  45. [45]

    Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro Von Werra, Arjun Guha, and Lingming Zhang. 2024. Selfcodealign: Self-alignment for code generation.arXiv preprint arXiv:2410.24198(2024)

  46. [46]

    Xi Ye, Fangcong Yin, Yinghui He, Joie Zhang, Howard Yen, Tianyu Gao, Greg Durrett, and Danqi Chen. 2025. Longproc: Benchmarking long-context language models on long procedural generation.arXiv preprint arXiv:2501.05414(2025)

  47. [47]

    Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. InProceedings of the 15th international conference on mining software repositories. 476–486

  48. [48]

    Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

  49. [49]

    Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou

  50. [50]

    CERT: continual pre-training on sketches for library-oriented code generation.arXiv preprint arXiv:2206.06888 (2022)

  51. [51]

    Daoguang Zan, Ailun Yu, Wei Liu, Dong Chen, Bo Shen, Wei Li, Yafen Yao, Yongshun Gong, Xiaolin Chen, Bei Guan, et al. 2024. Codes: Natural language to code repository via multi-layer sketch.arXiv preprint arXiv:2403.16443(2024)

  52. [52]

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. Repocoder: Repository-level code completion through iterative retrieval and generation.arXiv preprint arXiv:2303.12570(2023)

  53. [53]

    Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges.arXiv preprint arXiv:2401.07339(2024)

  54. [54]

    Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684. , Vol. 1, No. 1, Article . Publication...