pith. machine review for the scientific record. sign in

arxiv: 2306.03091 · v2 · submitted 2023-06-05 · 💻 cs.CL · cs.AI· cs.SE

Recognition: 1 theorem link

· Lean Theorem

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE
keywords code auto-completionrepository-levelbenchmarklarge language modelsPythonJavaretrievalevaluation
0
0 comments X

The pith

RepoBench introduces a benchmark for repository-level code auto-completion with three tasks covering retrieval, next-line prediction, and combined pipelines in Python and Java.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current benchmarks for code auto-completion focus on single-file tasks and leave a gap for realistic multi-file scenarios common in development. The paper introduces RepoBench to close this gap through three interconnected tasks: one for retrieving relevant code snippets from other files, one for predicting the next line with cross-file and in-file context, and one for pipelines that combine retrieval and prediction. This structure tests systems on complex repository contexts rather than isolated files. A sympathetic reader would care because improved evaluation could drive more capable auto-completion tools that handle dependencies across large codebases.

Core claim

RepoBench is a benchmark for evaluating repository-level code auto-completion systems that supports Python and Java and consists of three tasks: RepoBench-R measures retrieval of the most relevant code snippets from other files, RepoBench-C measures next-line prediction using both cross-file and in-file context, and RepoBench-P measures complex tasks that require combining retrieval and next-line prediction.

What carries the argument

The three interconnected tasks RepoBench-R (retrieval), RepoBench-C (code completion), and RepoBench-P (pipeline) that measure use of cross-file context in realistic repository settings.

Load-bearing premise

The three constructed tasks and data selection in RepoBench faithfully capture the challenges of real repository-level code completion without selection biases or artificial simplifications.

What would settle it

A finding that advanced models achieve similar performance on RepoBench as on single-file benchmarks, or that the tasks can be solved without genuine cross-file reasoning, would undermine the claim that RepoBench measures new repository-level capabilities.

read the original abstract

Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that existing code auto-completion benchmarks are limited to single-file tasks and introduces RepoBench to address this gap. RepoBench is a new benchmark for repository-level code auto-completion supporting Python and Java, with three tasks: RepoBench-R (retrieval of relevant cross-file code snippets), RepoBench-C (next-line code prediction using cross-file and in-file context), and RepoBench-P (combined retrieval and prediction pipeline). The benchmark is made publicly available to enable better evaluation and improvement of auto-completion systems.

Significance. If the tasks and data construction accurately reflect real-world multi-file dependencies without substantial selection biases, RepoBench would provide a valuable standardized benchmark for evaluating repository-level code completion, helping to drive progress in LLM-based systems that handle cross-file context in practical developer scenarios.

major comments (1)
  1. [§3] §3: The repo filtering (stars, language, size) and context extraction methods (import-based retrieval for RepoBench-R, line prediction for RepoBench-C) are described without any quantitative validation, statistics, or analysis showing that the chosen cross-file contexts are non-trivial, representative, or free of simplifications relative to real repositories. This directly affects the central claim that the three tasks close the assessment gap for complex, real-world scenarios.
minor comments (2)
  1. The abstract and introduction could more explicitly state the total number of repositories, files, and examples per task and language to convey the benchmark's scale.
  2. Figure or table captions should clarify how the three tasks interconnect in the pipeline evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the validation of our data construction methods.

read point-by-point responses
  1. Referee: [§3] §3: The repo filtering (stars, language, size) and context extraction methods (import-based retrieval for RepoBench-R, line prediction for RepoBench-C) are described without any quantitative validation, statistics, or analysis showing that the chosen cross-file contexts are non-trivial, representative, or free of simplifications relative to real repositories. This directly affects the central claim that the three tasks close the assessment gap for complex, real-world scenarios.

    Authors: We agree that the current description in §3 would benefit from quantitative validation to better support the claim of real-world relevance. In the revised manuscript, we will add a new subsection with statistics including: the distribution of repository sizes and star counts after filtering; the average number of cross-file imports per file in the sampled data; the distribution of context snippet lengths and dependency depths; and qualitative examples of multi-file interactions. These additions will demonstrate that the contexts are non-trivial and representative, directly addressing the concern about simplifications. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark tasks are explicitly defined from scratch

full rationale

The paper introduces RepoBench as a new benchmark with three explicitly constructed tasks (RepoBench-R for retrieval, RepoBench-C for next-line prediction with cross-file context, and RepoBench-P for combined pipeline). These are defined via repo filtering criteria, context extraction rules, and evaluation metrics that do not reduce to any fitted parameters, prior predictions, or self-citation chains. No equations or derivations are present that equate outputs to inputs by construction. The central claim (that the benchmark fills an assessment gap) rests on the novelty of the task definitions themselves, which are independent of any fitted results or uniqueness theorems from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The contribution rests on standard assumptions about the limitations of single-file benchmarks and the value of new evaluation tasks; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption Single-file benchmarks leave an assessment gap for multi-file programming scenarios
    Core motivation stated in the abstract.
  • domain assumption LLMs have advanced code auto-completion systems
    Background claim in the abstract.
invented entities (1)
  • RepoBench-R, RepoBench-C, RepoBench-P tasks no independent evidence
    purpose: To separately measure retrieval, code completion, and combined pipeline abilities at repository level
    Newly defined evaluation tasks introduced by the paper.

pith-pipeline@v0.9.0 · 5476 in / 1212 out tokens · 20377 ms · 2026-05-15T22:26:46.801004+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  2. InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

    cs.CL 2026-04 unverdicted novelty 8.0

    InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

  3. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    cs.CL 2023-08 unverdicted novelty 8.0

    LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

  4. When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

    cs.SE 2026-05 accept novelty 7.0

    Stale repository context in code RAG actively induces models to produce obsolete helper references, raising stale outputs by 76-88 percentage points over current-only retrieval in a 17-sample diagnostic study.

  5. Toward Executable Repository-Level Code Generation via Environment Alignment

    cs.SE 2026-04 unverdicted novelty 7.0

    EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.

  6. ABTest: Behavior-Driven Testing for AI Coding Agents

    cs.SE 2026-04 unverdicted novelty 7.0

    ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.

  7. Story Point Estimation Using Large Language Models

    cs.SE 2026-03 unverdicted novelty 7.0

    LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.

  8. Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 6.0

    LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

  9. Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

    cs.SE 2026-04 unverdicted novelty 6.0

    Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

  10. SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

    cs.NI 2026-04 unverdicted novelty 6.0

    SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...

  11. When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation

    cs.SE 2026-04 unverdicted novelty 6.0

    LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.

  12. KV Cache Offloading for Context-Intensive Tasks

    cs.LG 2026-04 unverdicted novelty 6.0

    KV offloading hurts accuracy on context-heavy tasks because of low-rank key projections and bad landmarks, but a simpler strategy improves results across models and benchmarks.

  13. KV Cache Offloading for Context-Intensive Tasks

    cs.LG 2026-04 unverdicted novelty 6.0

    KV offloading hurts accuracy on context-heavy tasks due to low-rank key projections and bad landmarks, but a simpler strategy recovers performance across models.

  14. KV Cache Offloading for Context-Intensive Tasks

    cs.LG 2026-04 unverdicted novelty 6.0

    KV offloading degrades performance on context-intensive tasks due to low-rank key projections and unreliable landmarks, but a simpler alternative strategy restores accuracy across LLM families.

  15. AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

    cs.CL 2026-04 unverdicted novelty 6.0

    AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.

  16. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  17. StarCoder 2 and The Stack v2: The Next Generation

    cs.SE 2024-02 accept novelty 6.0

    StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

  18. Evaluating LLM-Generated Code: A Benchmark and Developer Study

    cs.SE 2026-05 unverdicted novelty 5.0

    A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.

  19. Can LLMs be Effective Code Contributors? A Study on Open-source Projects

    cs.SE 2026-04 unverdicted novelty 5.0

    LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.

  20. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 18 Pith papers · 7 internal anchors

  1. [1]

    Colt5: Faster long-range transformers with conditional computation, 2023

    Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Onta˜n´on, Siddhartha Brahma, Yury Zemlyan- skiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, Yun-Hsuan Sung, and Sumit Sanghai. Colt5: Faster long-range transformers with conditional computation, 2023

  2. [2]

    Santacoder: don’t reach for the stars!,

    Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car- los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Ku- mar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo Gar...

  3. [3]

    URL https://arxiv.org/abs/2301.03988

  4. [4]

    Mining source code repositories at massive scale using language modeling

    Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR) , pp. 207–216, 2013. doi: 10.1109/MSR.2013.6624029

  5. [5]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. ArXiv preprint, abs/2108.07732, 2021. URL https://arxiv.org/abs/ 2108.07732

  6. [6]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. ArXiv preprint, abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150

  7. [7]

    Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. Unlimiformer: Long- range transformers with unlimited length input, 2023

  8. [8]

    Pavol Bielik, Veselin Raychev, and Martin T. Vechev. PHOG: probabilistic model for code. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 , volume 48 of JMLR Workshop and Conference Proceedings, pp. 2933–2942. JMLR.org, 2016. U...

  9. [9]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz L...

  10. [11]

    URL https://arxiv.org/abs/2107.03374

  11. [12]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: 10 Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311, 2022. URL https://arxiv.org/abs/2204.02311

  12. [13]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pp. 4171–4186, Minneapo- li...

  13. [14]

    Cocomic: Code completion by jointly modeling in-file and cross-file context, 2022

    Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Cocomic: Code completion by jointly modeling in-file and cross-file context, 2022. URL https://arxiv.org/abs/2212.10007

  14. [15]

    Unified language model pre-training for natural lan- guage understanding and generation

    Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural lan- guage understanding and generation. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelz- imer, Florence d’Alch´e-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neu- ral Information Pro...

  15. [16]

    CodeBERT: A pre-trained model for programming and natural languages

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pp. 1536–1547, Online, 2020. Association for Computational Linguistics. doi: 10...

  16. [17]

    Incoder: A generative model for code infilling and synthesis

    Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis, 2022. URL https://arxiv.org/abs/2204.05999

  17. [18]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. ArXiv preprint, abs/2101.00027, 2021. URL https://arxiv.org/abs/2101.00027

  18. [19]

    SimCSE: Simple contrastive learning of sentence embeddings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL https://aclant...

  19. [20]

    UniXcoder: Unified cross-modal pre-training for code representation

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. UniXcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pp. 7212–7225, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022....

  20. [21]

    Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on F oundations of Software Engineering, pp

    Vincent J Hellendoorn and Premkumar Devanbu. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on F oundations of Software Engineering, pp. 763–773, 2017

  21. [22]

    On the naturalness of software

    Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. On the naturalness of software. Communications of the ACM, 59(5):122–131, 2016

  22. [23]

    Reformer: The efficient transformer

    Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview.net/forum?id= rkgNKkHtvB

  23. [24]

    The stack: 3 tb of permissively licensed source code

    Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu˜noz Fer- randis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code. Preprint, 2022. 11

  24. [25]

    Starcoder: may the source be with you!, 2023

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo˜ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Log...

  25. [26]

    Multi-task learning based pre-trained language model for code completion

    Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. Multi-task learning based pre-trained language model for code completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 473–485, 2020

  26. [27]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023

  27. [28]

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. Codexglue: A machine learning benchmark dataset for code understandin...

  28. [29]

    Codegen: An open large language model for code with multi-turn program synthesis

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint, 2022

  29. [30]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  30. [31]

    Ctranslate2: A c++ and python library for efficient inference with transformer models

    OpenNMT. Ctranslate2: A c++ and python library for efficient inference with transformer models. https://github.com/OpenNMT/CTranslate2, 2023

  31. [32]

    Introducing 100k context windows

    Anthropic PBC. Introducing 100k context windows. https://www.anthropic.com/index/ 100k-context-windows, 2023. Accessed: 2023-05-27

  32. [33]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

  33. [34]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

  34. [35]

    Zero: Memory optimiza- tions toward training trillion parameter models, 2020

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models, 2020

  35. [36]

    Probabilistic model for code with decision trees

    Veselin Raychev, Pavol Bielik, and Martin Vechev. Probabilistic model for code with decision trees. SIGPLAN Not., 51(10):731–747, 2016. ISSN 0362-1340. doi: 10.1145/3022671.2984041. URL https://doi.org/10.1145/3022671.2984041

  36. [37]

    Probabilistic model for code with decision trees

    Veselin Raychev, Pavol Bielik, and Martin Vechev. Probabilistic model for code with decision trees. ACM SIGPLAN Notices, 51(10):731–747, 2016

  37. [38]

    Repository-level prompt generation for large language models of code, 2022

    Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository-level prompt generation for large language models of code, 2022. URL https://arxiv.org/abs/2206.12839

  38. [40]

    Intellicode compose: Code generation using transformer, 2020

    Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. Intellicode compose: Code generation using transformer, 2020

  39. [41]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971. 12

  40. [42]

    On the localness of software

    Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on F oundations of Software Engineering, pp. 269–280, 2014

  41. [43]

    Enriching source code with contextual data for code completion models: An empirical study, 2023

    Tim van Dam, Maliheh Izadi, and Arie van Deursen. Enriching source code with contextual data for code completion models: An empirical study, 2023

  42. [44]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Con- ferenc...

  43. [45]

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code large language models for code understanding and generation, 2023

  44. [46]

    Toward deep learning software repositories

    Martin White, Christopher Vendome, Mario Linares-V´asquez, and Denys Poshyvanyk. Toward deep learning software repositories. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 334–345. IEEE, 2015

  45. [47]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: ...

  46. [48]

    A systematic evaluation of large language models of code

    Frank F Xu, Uri Alon, Graham Neubig, and Vincent J Hellendoorn. A systematic evaluation of large language models of code. ArXiv preprint , abs/2202.13169, 2022. URL https: //arxiv.org/abs/2202.13169

  47. [49]

    Big bird: Transformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Onta˜n´on, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Process...

  48. [50]

    Repocoder: Repository-level code completion through iterative retrieval and generation, 2023

    Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation, 2023

  49. [51]

    Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023

    Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023. 13 A Ablation Study for Prompt Construction In this appendix, we present a pilot study that focuses on constru...

  50. [52]

    We investigate two cases - Short IFC (IFC-Short) which crops a maximum of the preceding 30 lines, and Long IFC (IFC-Long) which crops a maximum of the preceding 120 lines

    In-File Context (IFC): In-file contexts are the preceding n lines before the line we want to predict. We investigate two cases - Short IFC (IFC-Short) which crops a maximum of the preceding 30 lines, and Long IFC (IFC-Long) which crops a maximum of the preceding 120 lines

  51. [53]

    Import Statements (IS): To avoid losing the import statements due to cropping, we construct the prompt by concatenating all the IS before the IFC

  52. [54]

    We attach them at the beginning of the prompt

    Cross-File Context (XFC): Cross-file contexts are commented code snippets from other files parsed from import statements. We attach them at the beginning of the prompt. Table 5: Ablation study comparing different combinations of Cross-File Context (XFC), Import Statements (IS), and In-File Context (IFC) with both short (IFC-Short) and long (IFC-Long) vari...

  53. [55]

    The integration of both ISC and XFC shows the best overall results and significantly enhances cross-file code completion performance, even though there may be duplicated information between XFC and ISC

  54. [56]

    This improved performance is observed even when the included snippets do not specifically target in-file completion

    Including ISC and XFC improves not only cross-file completion but also in-file completion. This improved performance is observed even when the included snippets do not specifically target in-file completion

  55. [57]

    This suggests that extended context within the same file can potentially help prediction if it is not the first usage

    For XF-R settings, where the module that the next line will use is possibly used in the IFX, which may provide hints for next-line prediction, the inclusion of a longer in-file context (IF-Long) appears to be beneficial for Python. This suggests that extended context within the same file can potentially help prediction if it is not the first usage. Howeve...