Recognition: 1 theorem link
· Lean TheoremRepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Pith reviewed 2026-05-15 22:26 UTC · model grok-4.3
The pith
RepoBench introduces a benchmark for repository-level code auto-completion with three tasks covering retrieval, next-line prediction, and combined pipelines in Python and Java.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RepoBench is a benchmark for evaluating repository-level code auto-completion systems that supports Python and Java and consists of three tasks: RepoBench-R measures retrieval of the most relevant code snippets from other files, RepoBench-C measures next-line prediction using both cross-file and in-file context, and RepoBench-P measures complex tasks that require combining retrieval and next-line prediction.
What carries the argument
The three interconnected tasks RepoBench-R (retrieval), RepoBench-C (code completion), and RepoBench-P (pipeline) that measure use of cross-file context in realistic repository settings.
Load-bearing premise
The three constructed tasks and data selection in RepoBench faithfully capture the challenges of real repository-level code completion without selection biases or artificial simplifications.
What would settle it
A finding that advanced models achieve similar performance on RepoBench as on single-file benchmarks, or that the tasks can be solved without genuine cross-file reasoning, would undermine the claim that RepoBench measures new repository-level capabilities.
read the original abstract
Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing code auto-completion benchmarks are limited to single-file tasks and introduces RepoBench to address this gap. RepoBench is a new benchmark for repository-level code auto-completion supporting Python and Java, with three tasks: RepoBench-R (retrieval of relevant cross-file code snippets), RepoBench-C (next-line code prediction using cross-file and in-file context), and RepoBench-P (combined retrieval and prediction pipeline). The benchmark is made publicly available to enable better evaluation and improvement of auto-completion systems.
Significance. If the tasks and data construction accurately reflect real-world multi-file dependencies without substantial selection biases, RepoBench would provide a valuable standardized benchmark for evaluating repository-level code completion, helping to drive progress in LLM-based systems that handle cross-file context in practical developer scenarios.
major comments (1)
- [§3] §3: The repo filtering (stars, language, size) and context extraction methods (import-based retrieval for RepoBench-R, line prediction for RepoBench-C) are described without any quantitative validation, statistics, or analysis showing that the chosen cross-file contexts are non-trivial, representative, or free of simplifications relative to real repositories. This directly affects the central claim that the three tasks close the assessment gap for complex, real-world scenarios.
minor comments (2)
- The abstract and introduction could more explicitly state the total number of repositories, files, and examples per task and language to convey the benchmark's scale.
- Figure or table captions should clarify how the three tasks interconnect in the pipeline evaluation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the validation of our data construction methods.
read point-by-point responses
-
Referee: [§3] §3: The repo filtering (stars, language, size) and context extraction methods (import-based retrieval for RepoBench-R, line prediction for RepoBench-C) are described without any quantitative validation, statistics, or analysis showing that the chosen cross-file contexts are non-trivial, representative, or free of simplifications relative to real repositories. This directly affects the central claim that the three tasks close the assessment gap for complex, real-world scenarios.
Authors: We agree that the current description in §3 would benefit from quantitative validation to better support the claim of real-world relevance. In the revised manuscript, we will add a new subsection with statistics including: the distribution of repository sizes and star counts after filtering; the average number of cross-file imports per file in the sampled data; the distribution of context snippet lengths and dependency depths; and qualitative examples of multi-file interactions. These additions will demonstrate that the contexts are non-trivial and representative, directly addressing the concern about simplifications. revision: yes
Circularity Check
No circularity: benchmark tasks are explicitly defined from scratch
full rationale
The paper introduces RepoBench as a new benchmark with three explicitly constructed tasks (RepoBench-R for retrieval, RepoBench-C for next-line prediction with cross-file context, and RepoBench-P for combined pipeline). These are defined via repo filtering criteria, context extraction rules, and evaluation metrics that do not reduce to any fitted parameters, prior predictions, or self-citation chains. No equations or derivations are present that equate outputs to inputs by construction. The central claim (that the benchmark fills an assessment gap) rests on the novelty of the task definitions themselves, which are independent of any fitted results or uniqueness theorems from the authors' prior work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Single-file benchmarks leave an assessment gap for multi-file programming scenarios
- domain assumption LLMs have advanced code auto-completion systems
invented entities (1)
-
RepoBench-R, RepoBench-C, RepoBench-P tasks
no independent evidence
Forward citations
Cited by 20 Pith papers
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
-
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
-
When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context
Stale repository context in code RAG actively induces models to produce obsolete helper references, raising stale outputs by 76-88 percentage points over current-only retrieval in a 17-sample diagnostic study.
-
Toward Executable Repository-Level Code Generation via Environment Alignment
EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.
-
ABTest: Behavior-Driven Testing for AI Coding Agents
ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
-
Story Point Estimation Using Large Language Models
LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.
-
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
-
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
-
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
-
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
-
KV Cache Offloading for Context-Intensive Tasks
KV offloading hurts accuracy on context-heavy tasks because of low-rank key projections and bad landmarks, but a simpler strategy improves results across models and benchmarks.
-
KV Cache Offloading for Context-Intensive Tasks
KV offloading hurts accuracy on context-heavy tasks due to low-rank key projections and bad landmarks, but a simpler strategy recovers performance across models.
-
KV Cache Offloading for Context-Intensive Tasks
KV offloading degrades performance on context-intensive tasks due to low-rank key projections and unreliable landmarks, but a simpler alternative strategy restores accuracy across LLM families.
-
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
Evaluating LLM-Generated Code: A Benchmark and Developer Study
A custom three-fold methodology combining a complex-project correctness benchmark, code quality verification, and structured developer reviews to evaluate LLM-generated code beyond correctness alone.
-
Can LLMs be Effective Code Contributors? A Study on Open-source Projects
LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Reference graph
Works this paper leans on
-
[1]
Colt5: Faster long-range transformers with conditional computation, 2023
Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Onta˜n´on, Siddhartha Brahma, Yury Zemlyan- skiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, Yun-Hsuan Sung, and Sumit Sanghai. Colt5: Faster long-range transformers with conditional computation, 2023
work page 2023
-
[2]
Santacoder: don’t reach for the stars!,
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car- los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Ku- mar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo Gar...
- [3]
-
[4]
Mining source code repositories at massive scale using language modeling
Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR) , pp. 207–216, 2013. doi: 10.1109/MSR.2013.6624029
-
[5]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. ArXiv preprint, abs/2108.07732, 2021. URL https://arxiv.org/abs/ 2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. ArXiv preprint, abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[7]
Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. Unlimiformer: Long- range transformers with unlimited length input, 2023
work page 2023
-
[8]
Pavol Bielik, Veselin Raychev, and Martin T. Vechev. PHOG: probabilistic model for code. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 , volume 48 of JMLR Workshop and Conference Proceedings, pp. 2933–2942. JMLR.org, 2016. U...
work page 2016
-
[9]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz L...
work page 2020
-
[11]
URL https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: 10 Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311, 2022. URL https://arxiv.org/abs/2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pp. 4171–4186, Minneapo- li...
-
[14]
Cocomic: Code completion by jointly modeling in-file and cross-file context, 2022
Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Cocomic: Code completion by jointly modeling in-file and cross-file context, 2022. URL https://arxiv.org/abs/2212.10007
-
[15]
Unified language model pre-training for natural lan- guage understanding and generation
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural lan- guage understanding and generation. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelz- imer, Florence d’Alch´e-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neu- ral Information Pro...
work page 2019
-
[16]
CodeBERT: A pre-trained model for programming and natural languages
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pp. 1536–1547, Online, 2020. Association for Computational Linguistics. doi: 10...
-
[17]
Incoder: A generative model for code infilling and synthesis
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis, 2022. URL https://arxiv.org/abs/2204.05999
-
[18]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. ArXiv preprint, abs/2101.00027, 2021. URL https://arxiv.org/abs/2101.00027
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
SimCSE: Simple contrastive learning of sentence embeddings
Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL https://aclant...
-
[20]
UniXcoder: Unified cross-modal pre-training for code representation
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. UniXcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pp. 7212–7225, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022....
-
[21]
Vincent J Hellendoorn and Premkumar Devanbu. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on F oundations of Software Engineering, pp. 763–773, 2017
work page 2017
-
[22]
On the naturalness of software
Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. On the naturalness of software. Communications of the ACM, 59(5):122–131, 2016
work page 2016
-
[23]
Reformer: The efficient transformer
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview.net/forum?id= rkgNKkHtvB
work page 2020
-
[24]
The stack: 3 tb of permissively licensed source code
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu˜noz Fer- randis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code. Preprint, 2022. 11
work page 2022
-
[25]
Starcoder: may the source be with you!, 2023
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo˜ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Log...
work page 2023
-
[26]
Multi-task learning based pre-trained language model for code completion
Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. Multi-task learning based pre-trained language model for code completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 473–485, 2020
work page 2020
-
[27]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023
work page 2023
-
[28]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. Codexglue: A machine learning benchmark dataset for code understandin...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Codegen: An open large language model for code with multi-turn program synthesis
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint, 2022
work page 2022
- [30]
-
[31]
Ctranslate2: A c++ and python library for efficient inference with transformer models
OpenNMT. Ctranslate2: A c++ and python library for efficient inference with transformer models. https://github.com/OpenNMT/CTranslate2, 2023
work page 2023
-
[32]
Introducing 100k context windows
Anthropic PBC. Introducing 100k context windows. https://www.anthropic.com/index/ 100k-context-windows, 2023. Accessed: 2023-05-27
work page 2023
-
[33]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018
work page 2018
-
[34]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019
work page 2019
-
[35]
Zero: Memory optimiza- tions toward training trillion parameter models, 2020
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models, 2020
work page 2020
-
[36]
Probabilistic model for code with decision trees
Veselin Raychev, Pavol Bielik, and Martin Vechev. Probabilistic model for code with decision trees. SIGPLAN Not., 51(10):731–747, 2016. ISSN 0362-1340. doi: 10.1145/3022671.2984041. URL https://doi.org/10.1145/3022671.2984041
-
[37]
Probabilistic model for code with decision trees
Veselin Raychev, Pavol Bielik, and Martin Vechev. Probabilistic model for code with decision trees. ACM SIGPLAN Notices, 51(10):731–747, 2016
work page 2016
-
[38]
Repository-level prompt generation for large language models of code, 2022
Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository-level prompt generation for large language models of code, 2022. URL https://arxiv.org/abs/2206.12839
-
[40]
Intellicode compose: Code generation using transformer, 2020
Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. Intellicode compose: Code generation using transformer, 2020
work page 2020
-
[41]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on F oundations of Software Engineering, pp. 269–280, 2014
work page 2014
-
[43]
Enriching source code with contextual data for code completion models: An empirical study, 2023
Tim van Dam, Maliheh Izadi, and Arie van Deursen. Enriching source code with contextual data for code completion models: An empirical study, 2023
work page 2023
-
[44]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Con- ferenc...
work page 2017
-
[45]
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code large language models for code understanding and generation, 2023
work page 2023
-
[46]
Toward deep learning software repositories
Martin White, Christopher Vendome, Mario Linares-V´asquez, and Denys Poshyvanyk. Toward deep learning software repositories. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 334–345. IEEE, 2015
work page 2015
-
[47]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: ...
work page 2020
-
[48]
A systematic evaluation of large language models of code
Frank F Xu, Uri Alon, Graham Neubig, and Vincent J Hellendoorn. A systematic evaluation of large language models of code. ArXiv preprint , abs/2202.13169, 2022. URL https: //arxiv.org/abs/2202.13169
-
[49]
Big bird: Transformers for longer sequences
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Onta˜n´on, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Process...
work page 2020
-
[50]
Repocoder: Repository-level code completion through iterative retrieval and generation, 2023
Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation, 2023
work page 2023
-
[51]
Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023. 13 A Ablation Study for Prompt Construction In this appendix, we present a pilot study that focuses on constru...
work page 2023
-
[52]
In-File Context (IFC): In-file contexts are the preceding n lines before the line we want to predict. We investigate two cases - Short IFC (IFC-Short) which crops a maximum of the preceding 30 lines, and Long IFC (IFC-Long) which crops a maximum of the preceding 120 lines
-
[53]
Import Statements (IS): To avoid losing the import statements due to cropping, we construct the prompt by concatenating all the IS before the IFC
-
[54]
We attach them at the beginning of the prompt
Cross-File Context (XFC): Cross-file contexts are commented code snippets from other files parsed from import statements. We attach them at the beginning of the prompt. Table 5: Ablation study comparing different combinations of Cross-File Context (XFC), Import Statements (IS), and In-File Context (IFC) with both short (IFC-Short) and long (IFC-Long) vari...
-
[55]
The integration of both ISC and XFC shows the best overall results and significantly enhances cross-file code completion performance, even though there may be duplicated information between XFC and ISC
-
[56]
Including ISC and XFC improves not only cross-file completion but also in-file completion. This improved performance is observed even when the included snippets do not specifically target in-file completion
-
[57]
For XF-R settings, where the module that the next line will use is possibly used in the IFX, which may provide hints for next-line prediction, the inclusion of a longer in-file context (IF-Long) appears to be beneficial for Python. This suggests that extended context within the same file can potentially help prediction if it is not the first usage. Howeve...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.