arxiv: 2306.03091 · v2 · submitted 2023-06-05 · 💻 cs.CL · cs.AI· cs.SE

Recognition: 1 theorem link

· Lean Theorem

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Tianyang Liu , Canwen Xu , Julian McAuley

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE

keywords code auto-completionrepository-levelbenchmarklarge language modelsPythonJavaretrievalevaluation

0 comments

The pith

RepoBench introduces a benchmark for repository-level code auto-completion with three tasks covering retrieval, next-line prediction, and combined pipelines in Python and Java.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current benchmarks for code auto-completion focus on single-file tasks and leave a gap for realistic multi-file scenarios common in development. The paper introduces RepoBench to close this gap through three interconnected tasks: one for retrieving relevant code snippets from other files, one for predicting the next line with cross-file and in-file context, and one for pipelines that combine retrieval and prediction. This structure tests systems on complex repository contexts rather than isolated files. A sympathetic reader would care because improved evaluation could drive more capable auto-completion tools that handle dependencies across large codebases.

Core claim

RepoBench is a benchmark for evaluating repository-level code auto-completion systems that supports Python and Java and consists of three tasks: RepoBench-R measures retrieval of the most relevant code snippets from other files, RepoBench-C measures next-line prediction using both cross-file and in-file context, and RepoBench-P measures complex tasks that require combining retrieval and next-line prediction.

What carries the argument

The three interconnected tasks RepoBench-R (retrieval), RepoBench-C (code completion), and RepoBench-P (pipeline) that measure use of cross-file context in realistic repository settings.

Load-bearing premise

The three constructed tasks and data selection in RepoBench faithfully capture the challenges of real repository-level code completion without selection biases or artificial simplifications.

What would settle it

A finding that advanced models achieve similar performance on RepoBench as on single-file benchmarks, or that the tasks can be solved without genuine cross-file reasoning, would undermine the claim that RepoBench measures new repository-level capabilities.

read the original abstract

Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RepoBench gives a practical new benchmark with three tasks for repo-level code completion, but the repo filtering and context choices lack checks that they hit genuinely hard cross-file cases.

read the letter

The main point on this paper is that it releases RepoBench, a benchmark with three linked tasks—retrieval of cross-file snippets, next-line prediction using that context, and a pipeline that does both—for Python and Java code completion. That split is new relative to the single-file setups in prior work, and the public GitHub release makes it usable right away. The motivation section does a clean job explaining why file-level tests miss real developer workflows that span imports and multiple modules. The task definitions themselves are straightforward and give a clear way to measure progress on retrieval plus generation together. What feels thin is the data construction. The filtering by stars, size, and language plus the import-based or line-based context extraction could easily favor obvious dependencies rather than the messy, non-local ones that actually slow down real completion. Without numbers or examples showing that the chosen contexts require non-trivial cross-file reasoning, the claimed gap versus single-file baselines might shrink once people run harder repos. This is for researchers building or benchmarking LLM coding assistants who want something beyond file-level tests. A reader running experiments on code models would find the task structure and dataset useful to try. It should go to peer review because the core idea and release are timely and concrete, even if the validation of the tasks needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper claims that existing code auto-completion benchmarks are limited to single-file tasks and introduces RepoBench to address this gap. RepoBench is a new benchmark for repository-level code auto-completion supporting Python and Java, with three tasks: RepoBench-R (retrieval of relevant cross-file code snippets), RepoBench-C (next-line code prediction using cross-file and in-file context), and RepoBench-P (combined retrieval and prediction pipeline). The benchmark is made publicly available to enable better evaluation and improvement of auto-completion systems.

Significance. If the tasks and data construction accurately reflect real-world multi-file dependencies without substantial selection biases, RepoBench would provide a valuable standardized benchmark for evaluating repository-level code completion, helping to drive progress in LLM-based systems that handle cross-file context in practical developer scenarios.

major comments (1)

[§3] §3: The repo filtering (stars, language, size) and context extraction methods (import-based retrieval for RepoBench-R, line prediction for RepoBench-C) are described without any quantitative validation, statistics, or analysis showing that the chosen cross-file contexts are non-trivial, representative, or free of simplifications relative to real repositories. This directly affects the central claim that the three tasks close the assessment gap for complex, real-world scenarios.

minor comments (2)

The abstract and introduction could more explicitly state the total number of repositories, files, and examples per task and language to convey the benchmark's scale.
Figure or table captions should clarify how the three tasks interconnect in the pipeline evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the validation of our data construction methods.

read point-by-point responses

Referee: [§3] §3: The repo filtering (stars, language, size) and context extraction methods (import-based retrieval for RepoBench-R, line prediction for RepoBench-C) are described without any quantitative validation, statistics, or analysis showing that the chosen cross-file contexts are non-trivial, representative, or free of simplifications relative to real repositories. This directly affects the central claim that the three tasks close the assessment gap for complex, real-world scenarios.

Authors: We agree that the current description in §3 would benefit from quantitative validation to better support the claim of real-world relevance. In the revised manuscript, we will add a new subsection with statistics including: the distribution of repository sizes and star counts after filtering; the average number of cross-file imports per file in the sampled data; the distribution of context snippet lengths and dependency depths; and qualitative examples of multi-file interactions. These additions will demonstrate that the contexts are non-trivial and representative, directly addressing the concern about simplifications. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark tasks are explicitly defined from scratch

full rationale

The paper introduces RepoBench as a new benchmark with three explicitly constructed tasks (RepoBench-R for retrieval, RepoBench-C for next-line prediction with cross-file context, and RepoBench-P for combined pipeline). These are defined via repo filtering criteria, context extraction rules, and evaluation metrics that do not reduce to any fitted parameters, prior predictions, or self-citation chains. No equations or derivations are present that equate outputs to inputs by construction. The central claim (that the benchmark fills an assessment gap) rests on the novelty of the task definitions themselves, which are independent of any fitted results or uniqueness theorems from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The contribution rests on standard assumptions about the limitations of single-file benchmarks and the value of new evaluation tasks; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption Single-file benchmarks leave an assessment gap for multi-file programming scenarios
Core motivation stated in the abstract.
domain assumption LLMs have advanced code auto-completion systems
Background claim in the abstract.

invented entities (1)

RepoBench-R, RepoBench-C, RepoBench-P tasks no independent evidence
purpose: To separately measure retrieval, code completion, and combined pipeline abilities at repository level
Newly defined evaluation tasks introduced by the paper.

pith-pipeline@v0.9.0 · 5476 in / 1212 out tokens · 20377 ms · 2026-05-15T22:26:46.801004+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
cs.CL 2026-04 unverdicted novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
cs.CL 2023-08 unverdicted novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context
cs.SE 2026-05 accept novelty 7.0

Stale repository context in code RAG actively induces models to produce obsolete helper references, raising stale outputs by 76-88 percentage points over current-only retrieval in a 17-sample diagnostic study.
Toward Executable Repository-Level Code Generation via Environment Alignment
cs.SE 2026-04 unverdicted novelty 7.0

EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.
ABTest: Behavior-Driven Testing for AI Coding Agents
cs.SE 2026-04 unverdicted novelty 7.0

ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
Story Point Estimation Using Large Language Models
cs.SE 2026-03 unverdicted novelty 7.0

LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
cs.CL 2026-05 unverdicted novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
cs.SE 2026-04 unverdicted novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
cs.NI 2026-04 unverdicted novelty 6.0

SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
KV Cache Offloading for Context-Intensive Tasks
cs.LG 2026-04 unverdicted novelty 6.0

KV offloading hurts accuracy on context-heavy tasks because of low-rank key projections and bad landmarks, but a simpler strategy improves results across models and benchmarks.
KV Cache Offloading for Context-Intensive Tasks
cs.LG 2026-04 unverdicted novelty 6.0

KV offloading hurts accuracy on context-heavy tasks due to low-rank key projections and bad landmarks, but a simpler strategy recovers performance across models.
KV Cache Offloading for Context-Intensive Tasks
cs.LG 2026-04 unverdicted novelty 6.0

KV offloading degrades performance on context-intensive tasks due to low-rank key projections and unreliable landmarks, but a simpler alternative strategy restores accuracy across LLM families.
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
cs.CL 2026-04 unverdicted novelty 6.0

AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
StarCoder 2 and The Stack v2: The Next Generation
cs.SE 2024-02 accept novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Evaluating LLM-Generated Code: A Benchmark and Developer Study
cs.SE 2026-05 unverdicted novelty 5.0

A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.
Can LLMs be Effective Code Contributors? A Study on Open-source Projects
cs.SE 2026-04 unverdicted novelty 5.0

LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 18 Pith papers · 7 internal anchors

[1]

Colt5: Faster long-range transformers with conditional computation, 2023

Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Onta˜n´on, Siddhartha Brahma, Yury Zemlyan- skiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, Yun-Hsuan Sung, and Sumit Sanghai. Colt5: Faster long-range transformers with conditional computation, 2023

work page 2023
[2]

Santacoder: don’t reach for the stars!,

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car- los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Ku- mar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo Gar...

work page
[3]

URL https://arxiv.org/abs/2301.03988

work page arXiv
[4]

Mining source code repositories at massive scale using language modeling

Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR) , pp. 207–216, 2013. doi: 10.1109/MSR.2013.6624029

work page doi:10.1109/msr.2013.6624029 2013
[5]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. ArXiv preprint, abs/2108.07732, 2021. URL https://arxiv.org/abs/ 2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. ArXiv preprint, abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2004
[7]

Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. Unlimiformer: Long- range transformers with unlimited length input, 2023

work page 2023
[8]

Pavol Bielik, Veselin Raychev, and Martin T. Vechev. PHOG: probabilistic model for code. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 , volume 48 of JMLR Workshop and Conference Proceedings, pp. 2933–2942. JMLR.org, 2016. U...

work page 2016
[9]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz L...

work page 2020
[11]

URL https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv
[12]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: 10 Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311, 2022. URL https://arxiv.org/abs/2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pp. 4171–4186, Minneapo- li...

work page doi:10.18653/v1/n19-1423 2019
[14]

Cocomic: Code completion by jointly modeling in-file and cross-file context, 2022

Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Cocomic: Code completion by jointly modeling in-file and cross-file context, 2022. URL https://arxiv.org/abs/2212.10007

work page arXiv 2022
[15]

Unified language model pre-training for natural lan- guage understanding and generation

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural lan- guage understanding and generation. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelz- imer, Florence d’Alch´e-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neu- ral Information Pro...

work page 2019
[16]

CodeBERT: A pre-trained model for programming and natural languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pp. 1536–1547, Online, 2020. Association for Computational Linguistics. doi: 10...

work page doi:10.18653/v1/2020.findings-emnlp.139 2020
[17]

Incoder: A generative model for code inﬁlling and synthesis

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis, 2022. URL https://arxiv.org/abs/2204.05999

work page arXiv 2022
[18]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. ArXiv preprint, abs/2101.00027, 2021. URL https://arxiv.org/abs/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

SimCSE: Simple contrastive learning of sentence embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL https://aclant...

work page doi:10.18653/v1/2021.emnlp-main.552 2021
[20]

UniXcoder: Unified cross-modal pre-training for code representation

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. UniXcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pp. 7212–7225, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022....

work page doi:10.18653/v1/2022.acl-long.499 2022
[21]

Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on F oundations of Software Engineering, pp

Vincent J Hellendoorn and Premkumar Devanbu. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on F oundations of Software Engineering, pp. 763–773, 2017

work page 2017
[22]

On the naturalness of software

Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. On the naturalness of software. Communications of the ACM, 59(5):122–131, 2016

work page 2016
[23]

Reformer: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview.net/forum?id= rkgNKkHtvB

work page 2020
[24]

The stack: 3 tb of permissively licensed source code

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu˜noz Fer- randis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code. Preprint, 2022. 11

work page 2022
[25]

Starcoder: may the source be with you!, 2023

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo˜ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Log...

work page 2023
[26]

Multi-task learning based pre-trained language model for code completion

Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. Multi-task learning based pre-trained language model for code completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 473–485, 2020

work page 2020
[27]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023

work page 2023
[28]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. Codexglue: A machine learning benchmark dataset for code understandin...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Codegen: An open large language model for code with multi-turn program synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint, 2022

work page 2022
[30]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[31]

Ctranslate2: A c++ and python library for efficient inference with transformer models

OpenNMT. Ctranslate2: A c++ and python library for efficient inference with transformer models. https://github.com/OpenNMT/CTranslate2, 2023

work page 2023
[32]

Introducing 100k context windows

Anthropic PBC. Introducing 100k context windows. https://www.anthropic.com/index/ 100k-context-windows, 2023. Accessed: 2023-05-27

work page 2023
[33]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

work page 2018
[34]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

work page 2019
[35]

Zero: Memory optimiza- tions toward training trillion parameter models, 2020

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models, 2020

work page 2020
[36]

Probabilistic model for code with decision trees

Veselin Raychev, Pavol Bielik, and Martin Vechev. Probabilistic model for code with decision trees. SIGPLAN Not., 51(10):731–747, 2016. ISSN 0362-1340. doi: 10.1145/3022671.2984041. URL https://doi.org/10.1145/3022671.2984041

work page doi:10.1145/3022671.2984041 2016
[37]

Probabilistic model for code with decision trees

Veselin Raychev, Pavol Bielik, and Martin Vechev. Probabilistic model for code with decision trees. ACM SIGPLAN Notices, 51(10):731–747, 2016

work page 2016
[38]

Repository-level prompt generation for large language models of code, 2022

Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository-level prompt generation for large language models of code, 2022. URL https://arxiv.org/abs/2206.12839

work page arXiv 2022
[40]

Intellicode compose: Code generation using transformer, 2020

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. Intellicode compose: Code generation using transformer, 2020

work page 2020
[41]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

On the localness of software

Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on F oundations of Software Engineering, pp. 269–280, 2014

work page 2014
[43]

Enriching source code with contextual data for code completion models: An empirical study, 2023

Tim van Dam, Maliheh Izadi, and Arie van Deursen. Enriching source code with contextual data for code completion models: An empirical study, 2023

work page 2023
[44]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Con- ferenc...

work page 2017
[45]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code large language models for code understanding and generation, 2023

work page 2023
[46]

Toward deep learning software repositories

Martin White, Christopher Vendome, Mario Linares-V´asquez, and Denys Poshyvanyk. Toward deep learning software repositories. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 334–345. IEEE, 2015

work page 2015
[47]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: ...

work page 2020
[48]

A systematic evaluation of large language models of code

Frank F Xu, Uri Alon, Graham Neubig, and Vincent J Hellendoorn. A systematic evaluation of large language models of code. ArXiv preprint , abs/2202.13169, 2022. URL https: //arxiv.org/abs/2202.13169

work page arXiv 2022
[49]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Onta˜n´on, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Process...

work page 2020
[50]

Repocoder: Repository-level code completion through iterative retrieval and generation, 2023

Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation, 2023

work page 2023
[51]

Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023. 13 A Ablation Study for Prompt Construction In this appendix, we present a pilot study that focuses on constru...

work page 2023
[52]

We investigate two cases - Short IFC (IFC-Short) which crops a maximum of the preceding 30 lines, and Long IFC (IFC-Long) which crops a maximum of the preceding 120 lines

In-File Context (IFC): In-file contexts are the preceding n lines before the line we want to predict. We investigate two cases - Short IFC (IFC-Short) which crops a maximum of the preceding 30 lines, and Long IFC (IFC-Long) which crops a maximum of the preceding 120 lines

work page
[53]

Import Statements (IS): To avoid losing the import statements due to cropping, we construct the prompt by concatenating all the IS before the IFC

work page
[54]

We attach them at the beginning of the prompt

Cross-File Context (XFC): Cross-file contexts are commented code snippets from other files parsed from import statements. We attach them at the beginning of the prompt. Table 5: Ablation study comparing different combinations of Cross-File Context (XFC), Import Statements (IS), and In-File Context (IFC) with both short (IFC-Short) and long (IFC-Long) vari...

work page
[55]

The integration of both ISC and XFC shows the best overall results and significantly enhances cross-file code completion performance, even though there may be duplicated information between XFC and ISC

work page
[56]

This improved performance is observed even when the included snippets do not specifically target in-file completion

Including ISC and XFC improves not only cross-file completion but also in-file completion. This improved performance is observed even when the included snippets do not specifically target in-file completion

work page
[57]

This suggests that extended context within the same file can potentially help prediction if it is not the first usage

For XF-R settings, where the module that the next line will use is possibly used in the IFX, which may provide hints for next-line prediction, the inclusion of a longer in-file context (IF-Long) appears to be beneficial for Python. This suggests that extended context within the same file can potentially help prediction if it is not the first usage. Howeve...

work page 2023