CodeTeam: An LLM-Powered Multi-Agent Framework for Repository-Level Code Generation

Arif Ali Khan; Mojtaba Shahin; Peng Liang; Qiong Feng; Ruiyin Li; Yifei Wang; Zengyang Li

arxiv: 2606.22082 · v1 · pith:V62T2DMQnew · submitted 2026-06-20 · 💻 cs.SE · cs.AI

CodeTeam: An LLM-Powered Multi-Agent Framework for Repository-Level Code Generation

Yifei Wang , Ruiyin Li , Peng Liang , Qiong Feng , Zengyang Li , Mojtaba Shahin , Arif Ali Khan This is my paper

Pith reviewed 2026-06-26 11:35 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords multi-agent frameworkrepository-level code generationNL2Reposoftware design sketchesLLM coordinationcode generation benchmarkstest-driven repair

0 comments

The pith

A multi-agent LLM framework divides repository code generation into distinct planning, selection, and implementation stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CodeTeam to tackle natural language to repository generation, a task that requires long planning horizons, stable interfaces across files, and fixing cross-file inconsistencies. It assigns separate roles to multiple Architect agents that produce competing design sketches, a CTO agent that selects and formalizes one into a machine-checkable contract, Developer agents that implement code under a dependency-aware scheduler with Git coordination, and a QA agent that drives test-based repairs. Experiments on SketchEval show SketchBLEU gains of 4.1 and 2.9 points over CodeS in prompt-engineering and fine-tuning settings, while NL2Repo-Bench reports the highest average test pass rates of 34.6 percent and 42.3 percent. A reader would care because the method offers a concrete way to scale code generation beyond single functions while maintaining functional correctness.

Core claim

CodeTeam is an LLM-based multi-agent framework that separates planning, decision making, and implementation into distinct, coordinated stages. In the planning stage, multiple Architect agents draft competing software design sketches optionally grounded by retrieved design references. A CTO agent evaluates, selects, and normalizes the most promising sketch into a machine-checkable contract that specifies file ownership, public interfaces, and dependency constraints. In the implementation stage, Developer agents generate code under a dependency-aware scheduler with bounded context and lightweight Git-based coordination, while a QA agent runs tests and drives iterative repairs. On SketchEval th

What carries the argument

The multi-agent coordination structure that assigns Architect agents to draft software design sketches, a CTO agent to produce a machine-checkable contract, Developer agents to implement under a dependency-aware scheduler, and a QA agent to perform test-driven repairs.

If this is right

Project-specific developer allocation contributes 9.9 percent relative improvement to SketchBLEU.
Retrieval-augmented planning contributes 8.1 percent relative improvement to SketchBLEU.
Sketch-level improvements translate directly to higher functional correctness on execution-based test suites.
The staged coordination reduces cross-file inconsistencies during iterative debugging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pattern of competing design sketches followed by contract normalization could apply to other multi-component generation tasks that need consistency guarantees.
Lightweight Git-based coordination between agents may scale to larger repositories if the contract remains stable.
If the QA-driven repair loop generalizes, similar agent separation might reduce error accumulation in long-horizon planning problems outside code generation.

Load-bearing premise

The reported gains come from the multi-agent roles and coordination mechanisms rather than from differences in the base LLM or prompt details relative to the CodeS baselines.

What would settle it

Re-running the CodeS baselines with identical base models, prompts, and implementation details as CodeTeam and finding no performance difference on SketchEval or NL2Repo-Bench.

Figures

Figures reproduced from arXiv: 2606.22082 by Arif Ali Khan, Mojtaba Shahin, Peng Liang, Qiong Feng, Ruiyin Li, Yifei Wang, Zengyang Li.

**Figure 1.** Figure 1: CodeTeam workflow from requirements preprocessing ( 0 ) to architect planning (❶), cto selection (❷), solution materialization and Git-coordinated implementation (❸–❹), qa-driven repair (❺–❻), and final repository output (❼). fixed repository file hierarchy, each file’s responsibility and public interface, and explicit dependency edges among files. Concretely, an SDS specifies: (1) the technology stack, in… view at source ↗

**Figure 2.** Figure 2: Overall SketchBLEU on SketchEval (all 19 tasks). 4.1.2 SketchBLEU Sub-score Decomposition. Moving from PE to SFT, the n-gram (B.) and weighted n-gram (B.W.) sub-scores of CodeS improve by 12.3 and 13.0 absolute points, respectively, whereas the structural (M.S.) and dataflow (M.D.) sub-scores improve by 9.2 and 7.9 points. This asymmetry suggests that fine-tuning primarily helps the backbone model reproduc… view at source ↗

**Figure 3.** Figure 3: Repository size alignment on SketchEval (PE setting). Values are mean absolute errors against reference repositories. CodeTeam exhibits a more pronounced advantage on the structural (M.S.) and dataflow (M.D.) components, which are specifically designed to capture repository-level integrity and semantic consistency beyond surface-level token overlap. This finding aligns with the design rationale of CodeTeam… view at source ↗

**Figure 4.** Figure 4: RAG improves planning-stage quality signals (PE setting). [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Dynamic developer allocation improves convergence and reduces cross-file mismatch indicators (PE [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

read the original abstract

Natural language to repository generation (NL2Repo) requires a system to construct an entire software repository from a natural-language requirements document. Compared with function-level code generation, this task demands longer planning horizons, stable interfaces across files, and iterative debugging of cross-file inconsistencies. To address these challenges, we propose CodeTeam, an LLM-based multi-agent framework that separates planning, decision making, and implementation into distinct, coordinated stages. In the planning stage, multiple Architect agents draft competing software design sketches (SDS), optionally grounded by retrieved design references. A CTO agent then evaluates, selects, and normalizes the most promising SDS into a machine-checkable contract that specifies file ownership, public interfaces, and dependency constraints. In the implementation stage, Developer agents generate code under a dependency-aware scheduler with bounded context and lightweight Git-based coordination, while a QA agent runs tests and drives iterative repairs. On the synthesis-based SketchEval benchmark, we explicitly compare CodeTeam's prompt-engineering (PE) and supervised fine-tuning (SFT) variants with the corresponding CodeS variants, where CodeTeam improves the overall SketchBLEU by 4.1 and 2.9 absolute points, respectively. On the execution-based NL2Repo-Bench benchmark, used as an external validation protocol, CodeTeam achieves the highest average test pass rate in both settings (34.6% PE, 42.3% SFT), confirming that the sketch-improvements extend to functional correctness under upstream test suites. Ablation results show that project-specific developer allocation and retrieval-augmented planning each contribute substantially to the SketchBLEU improvement (9.9% and 8.1% relative, respectively). CodeTeam and the experimental results are available at https://github.com/WhitenWhiten/CodeTeam

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeTeam reports concrete benchmark lifts from its multi-agent workflow, but the attribution to coordination mechanisms rests on unverified baseline equivalence with CodeS.

read the letter

The main takeaway is that CodeTeam posts measurable gains on SketchEval (4.1 and 2.9 SketchBLEU) and leads NL2Repo-Bench pass rates (34.6% PE, 42.3% SFT), yet those numbers cannot be cleanly credited to the competing architects, CTO contracts, or Git scheduler until the CodeS baselines are shown to match on model, prompts, and retrieval.

What is actually new is the specific pipeline: multiple Architect agents produce competing SDS, a CTO normalizes the winner into machine-checkable contracts for file ownership and interfaces, then a dependency-aware scheduler assigns work to Developers who coordinate lightly via Git, with a QA loop for repairs. The ablations quantify two pieces—project-specific allocation and retrieval-augmented planning—at 9.9% and 8.1% relative SketchBLEU lift. Releasing the code and running both a synthesis benchmark and an execution benchmark is useful.

The soft spots sit in the comparison details. The abstract calls the CodeS runs “corresponding variants” in PE and SFT regimes, but supplies no confirmation that the base LLM, temperature, context window, or fine-tuning data were identical. Without that, the deltas could trace to prompt engineering or data differences rather than the multi-agent structure. No error bars, significance tests, or exclusion rules appear in the reported numbers. The circularity burden is low because the results are empirical rather than fitted equations.

This paper is for researchers in AI-assisted software engineering who already work on repo-level generation and want a concrete workflow plus ablations to build on. A reader who needs to reproduce the exact setup or test the coordination claims will get value once the baseline controls are tightened.

It deserves peer review. The experiments are grounded enough and the framework is described at a level that can be checked against the released code, even if the current write-up needs more explicit matching on the baselines.

Referee Report

2 major / 2 minor

Summary. The paper proposes CodeTeam, an LLM-based multi-agent framework for natural language to repository-level code generation (NL2Repo). It divides the task into a planning stage (multiple competing Architect agents producing Software Design Sketches optionally augmented by retrieval, followed by a CTO agent that selects and normalizes the best sketch into a machine-checkable contract specifying file ownership, interfaces, and dependencies) and an implementation stage (Developer agents generating code under a dependency-aware scheduler with Git coordination, plus a QA agent performing test-driven iterative repairs). On the SketchEval benchmark the PE and SFT variants are reported to improve overall SketchBLEU by 4.1 and 2.9 absolute points over corresponding CodeS variants; on the external NL2Repo-Bench they achieve the highest average test pass rates (34.6 % PE, 42.3 % SFT). Ablations attribute 9.9 % and 8.1 % relative SketchBLEU gains to project-specific developer allocation and retrieval-augmented planning, respectively. Code and results are released.

Significance. If the numerical gains can be shown to arise from the multi-agent coordination mechanisms rather than unmatched base models, prompts, or implementation details, the work would provide concrete evidence that structured agent分工 (competing architects, contract normalization, dependency scheduling, Git coordination, QA loop) improves long-horizon repository generation over single-agent or simpler baselines. The explicit PE/SFT comparison protocol and public release of code strengthen the empirical contribution.

major comments (2)

[Abstract] Abstract: the headline claim that CodeTeam improves SketchBLEU by 4.1/2.9 points and reaches the highest NL2Repo-Bench pass rates (34.6 %/42.3 %) over 'corresponding CodeS variants' in both PE and SFT regimes is load-bearing for the central thesis, yet the manuscript supplies no evidence that the CodeS baselines used identical base LLMs, temperature, retrieval setup, or fine-tuning corpus; without such controls the deltas cannot be attributed to the multi-agent pipeline (competing Architects, CTO contract, dependency scheduler, Git coordination, QA loop).
[Abstract] Abstract (ablations paragraph): the reported 9.9 % and 8.1 % relative contributions from 'project-specific developer allocation' and 'retrieval-augmented planning' are presented as supporting the architectural mechanisms, but the same baseline-equivalence issue applies; the ablations must be shown to hold the base LLM and prompt engineering fixed.

minor comments (2)

[Abstract] The abstract states that results are available at a GitHub link; the manuscript should also include a brief description of the exact models, temperatures, and retrieval corpora used for both CodeTeam and CodeS variants so that readers can verify equivalence without external inspection.
[Abstract] Notation: 'Software Design Sketch (SDS)' and 'CTO agent' are introduced without prior definition in the abstract; a short parenthetical gloss on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your detailed and constructive review. We appreciate the focus on ensuring that performance gains can be attributed to the proposed multi-agent mechanisms. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that CodeTeam improves SketchBLEU by 4.1/2.9 points and reaches the highest NL2Repo-Bench pass rates (34.6 %/42.3 %) over 'corresponding CodeS variants' in both PE and SFT regimes is load-bearing for the central thesis, yet the manuscript supplies no evidence that the CodeS baselines used identical base LLMs, temperature, retrieval setup, or fine-tuning corpus; without such controls the deltas cannot be attributed to the multi-agent pipeline (competing Architects, CTO contract, dependency scheduler, Git coordination, QA loop).

Authors: We agree that baseline equivalence is essential for attributing gains to the multi-agent design. The CodeS variants were implemented using identical base LLMs, temperature (0.7), retrieval setup, and fine-tuning corpus as the corresponding CodeTeam variants; this matching protocol is described in Section 4 (Experimental Setup) of the full manuscript. To address the concern directly in the abstract, we will revise the abstract to briefly note the matched conditions. revision: yes
Referee: [Abstract] Abstract (ablations paragraph): the reported 9.9 % and 8.1 % relative contributions from 'project-specific developer allocation' and 'retrieval-augmented planning' are presented as supporting the architectural mechanisms, but the same baseline-equivalence issue applies; the ablations must be shown to hold the base LLM and prompt engineering fixed.

Authors: The ablations isolate the contributions of project-specific developer allocation and retrieval-augmented planning by removing or modifying only those components while holding the base LLM, temperature, prompt templates, and retrieval configuration fixed. This design ensures the reported relative gains are due to the ablated mechanisms. We will revise the abstract to explicitly state that the ablations maintain fixed base model and prompt settings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparisons only

full rationale

The paper reports measured improvements on SketchEval (SketchBLEU deltas) and NL2Repo-Bench (pass rates) plus ablation percentages. No equations, fitted parameters, predictions, or first-principles derivations exist that could reduce to inputs by construction. Comparisons to CodeS variants and ablations are presented as external evidence; no self-citation chain or self-definitional step carries the central claim. This is standard empirical SE work with independent benchmark grounding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Framework rests on domain assumptions about LLM role-playing effectiveness and benchmark representativeness; introduces new agent roles and SDS concept without external independent evidence beyond the reported runs.

axioms (2)

domain assumption LLMs can reliably perform specialized software engineering roles (architect, CTO, developer, QA) when given structured prompts and contracts.
Central to the separation of planning and implementation stages described in the abstract.
domain assumption The SketchEval and NL2Repo-Bench benchmarks are appropriate proxies for real repository-level generation tasks.
Used to claim superiority over CodeS baselines.

invented entities (2)

Software Design Sketch (SDS) no independent evidence
purpose: Competing design drafts produced by Architect agents
New intermediate artifact introduced for the planning stage.
CTO agent no independent evidence
purpose: Evaluates sketches and produces normalized machine-checkable contract
Invented decision-making role not present in baseline CodeS.

pith-pipeline@v0.9.1-grok · 5877 in / 1425 out tokens · 34446 ms · 2026-06-26T11:35:44.933384+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 13 linked inside Pith

[1]

Barry Boehm and Victor R. Basili. 2001. Software Defect Reduction Top 10 List.IEEE Computer34, 1 (2001), 135–137

2001
[2]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi- Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation.arXiv preprint arXiv:2402.03216(2024)

Pith/arXiv arXiv 2024
[3]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)

Pith/arXiv arXiv 2021
[4]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InProceedings of the 12th International Conference on Learning Representations (ICLR). OpenReview.net, 1–23. ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. 0:34 Wang et al

2024
[5]

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2024. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. InProceedings of the 12th International Conference on Learning Representations (ICLR). OpenReview.net, 1–19

2024
[6]

Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.arXiv preprint arXiv:2307.08691(2023)

Pith/arXiv arXiv 2023
[7]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized LLMs. InProceedings of the 37th Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS Proceedings, 10088–10115

2023
[8]

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wa...

arXiv 2025
[9]

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. 2023. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. InProceedings of the 37th Annual Conference on Neural Information Processing Systems (NeurIPS). NeurI...

2023
[10]

Matthijs Douze, Anton Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss Library.arXiv preprint arXiv:2401.08281(2024)

Pith/arXiv arXiv 2024
[11]

Bradley Efron. 1979. Bootstrap Methods: Another Look at the Jackknife.Annals of Statistics7, 1 (1979), 1–26

1979
[12]

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. InCoder: A Generative Model for Code Infilling and Synthesis.arXiv preprint arXiv:2204.05999(2022)

Pith/arXiv arXiv 2022
[13]

Significant Gravitas. 2023. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT

2023
[14]

Hall and Ken Kennedy

Mary W. Hall and Ken Kennedy. 1992. Efficient Call Graph Analysis.ACM Letters on Programming Languages and Systems1, 3 (1992), 227–242

1992
[15]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS.arXiv preprint arXiv:2105.09938(2021)

Pith/arXiv arXiv 2021
[16]

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. InProceedings of the 12th International Conference on Learning Repre...

2024
[17]

Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez

Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. 2024. MapCoder: Multi-Agent Code Generation for Competitive Problem Solving. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 4917–4942

2024
[18]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Program- matic Context. InProceedings of the 23rd Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 1643–1652

2018
[19]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE- bench: Can Language Models Resolve Real-World GitHub Issues?. InProceedings of the 12th International Conference on Learning Representations (ICLR). OpenReview.net, 1–51

2024
[20]

Syed Mohammad Kashif, Ruiyin Li, Peng Liang, Amjed Tahir, Qiong Feng, Zengyang Li, and Mojtaba Shahin. 2026. Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects.arXiv preprint arXiv:2604.06373 (2026)

Pith/arXiv arXiv 2026
[21]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.arXiv preprint arXiv:2005.11401(2020)

Pith/arXiv arXiv 2020
[22]

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. 2025. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 17160–17176

2025
[23]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2024. RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems. InProceedings of the 12th International Conference on Learning Representations (ICLR). OpenRe- view.net, 1–19. ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. CodeTeam: An LLM-Powered Multi-Agent Fr...

2024
[24]

Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi Jin, and Qianxiang Wang. 2024. GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model.arXiv preprint arXiv:2406.07003(2024)

arXiv 2024
[25]

Yu. A. Malkov and D. A. Yashunin. 2016. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.arXiv preprint arXiv:1603.09320(2016)

Pith/arXiv arXiv 2016
[26]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong
[27]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis.arXiv preprint arXiv:2203.13474(2022)

Pith/arXiv arXiv 2022
[28]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 311–318

2002
[29]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 15174–15186

2024
[30]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.arXiv preprint arXiv:1910.02054(2020)

Pith/arXiv arXiv 2020
[31]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis.arXiv preprint arXiv:2009.10297 (2020)

Pith/arXiv arXiv 2020
[32]

Tal Ridnik, Dedy Kredo, and Itamar Friedman. 2024. Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering.arXiv preprint arXiv:2401.08500(2024)

arXiv 2024
[33]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InProceedings of the 37th Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS Proceedings, 8634–8652

2023
[34]

Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115(2024)

Pith/arXiv arXiv 2024
[35]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

2025
[36]

Yifei Wang, Ruiyin Li, Peng Liang, Qiong Feng, Zengyang Li, Mojtaba Shahin, and Arif Ali Khan. 2026. Replication Package for the Paper: CodeTeam: An LLM-Powered Multi-Agent Framework for Repository-Level Code Generation. https://github.com/WhitenWhiten/CodeTeam

2026
[37]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. InProceedings of the 26th Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 8696–8708

2021
[38]

Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods.Biometrics Bulletin1, 6 (1945), 80–83

1945
[39]

Ohlsson, Björn Regnell, and Anders Wesslén

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012.Experimentation in Software Engineering. Springer

2012
[40]

Hui Yang, Sifu Yue, and Yunzhong He. 2023. Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions.arXiv preprint arXiv:2306.02224(2023)

arXiv 2023
[41]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InProceedings of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS Proceedings, 50528–50652

2024
[42]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. InProceedings of the 15th International Conference on Mining Software Repositories (MSR). ACM, 476–486

2018
[43]

Daoguang Zan, Ailun Yu, Wei Liu, Dong Chen, Bo Shen, Yafen Yao, Wei Li, Xiaolin Chen, Yongshun Gong, Bei Guan, Zhiguang Yang, Yongji Wang, Lizhen Cui, and Qianxiang Wang. 2025. CodeS: Natural Language to Code Repository via Multi-Layer Sketch.ACM Transactions on Software Engineering and Methodology(2025)

2025
[44]

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen
[45]

InProceedings of the 28th Conference on Empirical Methods in Natural Language Processing (EMNLP)

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 28th Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2471–2484
[46]

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. InProceedings of the 62nd Annual Meeting of the Association ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. 0:36 Wang et al. for Computational L...

2024
[47]

Qianhui Zhao, Li Zhang, Fang Liu, Junhang Cheng, Chengru Wu, Junchen Ai, Qiaoyuanhe Meng, Lichen Zhang, Xiaoli Lian, Shubin Song, and Yuanping Guo. 2025. Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling.arXiv preprint arXiv:2511.03404(2025)

arXiv 2025
[48]

Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig

Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. 2023. DocPrompting: Generating Code by Retrieving the Docs. InProceedings of the 11th International Conference on Learning Representations (ICLR). OpenReview.net, 1–16. ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026

2023

[1] [1]

Barry Boehm and Victor R. Basili. 2001. Software Defect Reduction Top 10 List.IEEE Computer34, 1 (2001), 135–137

2001

[2] [2]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi- Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation.arXiv preprint arXiv:2402.03216(2024)

Pith/arXiv arXiv 2024

[3] [3]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)

Pith/arXiv arXiv 2021

[4] [4]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InProceedings of the 12th International Conference on Learning Representations (ICLR). OpenReview.net, 1–23. ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. 0:34 Wang et al

2024

[5] [5]

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2024. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. InProceedings of the 12th International Conference on Learning Representations (ICLR). OpenReview.net, 1–19

2024

[6] [6]

Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.arXiv preprint arXiv:2307.08691(2023)

Pith/arXiv arXiv 2023

[7] [7]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized LLMs. InProceedings of the 37th Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS Proceedings, 10088–10115

2023

[8] [8]

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wa...

arXiv 2025

[9] [9]

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. 2023. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. InProceedings of the 37th Annual Conference on Neural Information Processing Systems (NeurIPS). NeurI...

2023

[10] [10]

Matthijs Douze, Anton Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss Library.arXiv preprint arXiv:2401.08281(2024)

Pith/arXiv arXiv 2024

[11] [11]

Bradley Efron. 1979. Bootstrap Methods: Another Look at the Jackknife.Annals of Statistics7, 1 (1979), 1–26

1979

[12] [12]

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. InCoder: A Generative Model for Code Infilling and Synthesis.arXiv preprint arXiv:2204.05999(2022)

Pith/arXiv arXiv 2022

[13] [13]

Significant Gravitas. 2023. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT

2023

[14] [14]

Hall and Ken Kennedy

Mary W. Hall and Ken Kennedy. 1992. Efficient Call Graph Analysis.ACM Letters on Programming Languages and Systems1, 3 (1992), 227–242

1992

[15] [15]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS.arXiv preprint arXiv:2105.09938(2021)

Pith/arXiv arXiv 2021

[16] [16]

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. InProceedings of the 12th International Conference on Learning Repre...

2024

[17] [17]

Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez

Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. 2024. MapCoder: Multi-Agent Code Generation for Competitive Problem Solving. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 4917–4942

2024

[18] [18]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping Language to Code in Program- matic Context. InProceedings of the 23rd Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 1643–1652

2018

[19] [19]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE- bench: Can Language Models Resolve Real-World GitHub Issues?. InProceedings of the 12th International Conference on Learning Representations (ICLR). OpenReview.net, 1–51

2024

[20] [20]

Syed Mohammad Kashif, Ruiyin Li, Peng Liang, Amjed Tahir, Qiong Feng, Zengyang Li, and Mojtaba Shahin. 2026. Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects.arXiv preprint arXiv:2604.06373 (2026)

Pith/arXiv arXiv 2026

[21] [21]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.arXiv preprint arXiv:2005.11401(2020)

Pith/arXiv arXiv 2020

[22] [22]

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. 2025. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 17160–17176

2025

[23] [23]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2024. RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems. InProceedings of the 12th International Conference on Learning Representations (ICLR). OpenRe- view.net, 1–19. ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. CodeTeam: An LLM-Powered Multi-Agent Fr...

2024

[24] [24]

Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi Jin, and Qianxiang Wang. 2024. GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model.arXiv preprint arXiv:2406.07003(2024)

arXiv 2024

[25] [25]

Yu. A. Malkov and D. A. Yashunin. 2016. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.arXiv preprint arXiv:1603.09320(2016)

Pith/arXiv arXiv 2016

[26] [26]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong

[27] [27]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis.arXiv preprint arXiv:2203.13474(2022)

Pith/arXiv arXiv 2022

[28] [28]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 311–318

2002

[29] [29]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 15174–15186

2024

[30] [30]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.arXiv preprint arXiv:1910.02054(2020)

Pith/arXiv arXiv 2020

[31] [31]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis.arXiv preprint arXiv:2009.10297 (2020)

Pith/arXiv arXiv 2020

[32] [32]

Tal Ridnik, Dedy Kredo, and Itamar Friedman. 2024. Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering.arXiv preprint arXiv:2401.08500(2024)

arXiv 2024

[33] [33]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InProceedings of the 37th Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS Proceedings, 8634–8652

2023

[34] [34]

Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115(2024)

Pith/arXiv arXiv 2024

[35] [35]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

2025

[36] [36]

Yifei Wang, Ruiyin Li, Peng Liang, Qiong Feng, Zengyang Li, Mojtaba Shahin, and Arif Ali Khan. 2026. Replication Package for the Paper: CodeTeam: An LLM-Powered Multi-Agent Framework for Repository-Level Code Generation. https://github.com/WhitenWhiten/CodeTeam

2026

[37] [37]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. InProceedings of the 26th Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 8696–8708

2021

[38] [38]

Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods.Biometrics Bulletin1, 6 (1945), 80–83

1945

[39] [39]

Ohlsson, Björn Regnell, and Anders Wesslén

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012.Experimentation in Software Engineering. Springer

2012

[40] [40]

Hui Yang, Sifu Yue, and Yunzhong He. 2023. Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions.arXiv preprint arXiv:2306.02224(2023)

arXiv 2023

[41] [41]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InProceedings of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS Proceedings, 50528–50652

2024

[42] [42]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. InProceedings of the 15th International Conference on Mining Software Repositories (MSR). ACM, 476–486

2018

[43] [43]

Daoguang Zan, Ailun Yu, Wei Liu, Dong Chen, Bo Shen, Yafen Yao, Wei Li, Xiaolin Chen, Yongshun Gong, Bei Guan, Zhiguang Yang, Yongji Wang, Lizhen Cui, and Qianxiang Wang. 2025. CodeS: Natural Language to Code Repository via Multi-Layer Sketch.ACM Transactions on Software Engineering and Methodology(2025)

2025

[44] [44]

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen

[45] [45]

InProceedings of the 28th Conference on Empirical Methods in Natural Language Processing (EMNLP)

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 28th Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2471–2484

[46] [46]

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. InProceedings of the 62nd Annual Meeting of the Association ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. 0:36 Wang et al. for Computational L...

2024

[47] [47]

Qianhui Zhao, Li Zhang, Fang Liu, Junhang Cheng, Chengru Wu, Junchen Ai, Qiaoyuanhe Meng, Lichen Zhang, Xiaoli Lian, Shubin Song, and Yuanping Guo. 2025. Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling.arXiv preprint arXiv:2511.03404(2025)

arXiv 2025

[48] [48]

Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig

Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. 2023. DocPrompting: Generating Code by Retrieving the Docs. InProceedings of the 11th International Conference on Learning Representations (ICLR). OpenReview.net, 1–16. ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026

2023