hub Canonical reference

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Tianyang Liu, Canwen Xu, Julian McAuley · 2023 · cs.CL · arXiv 2306.03091

Canonical reference. 100% of citing Pith papers cite this work as background.

47 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 47 citing papers arXiv PDF

abstract

Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 2 method 1

citation-polarity summary

background 9

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository

cs.SE · 2026-01-20 · accept · novelty 8.0

RepoGenesis benchmark shows top AI systems reach only 23.67% Pass@1 on full microservice repository generation despite up to 73.91% API coverage and 100% deployment success.

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

cs.CL · 2023-08-28 · unverdicted · novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

cs.AI · 2026-06-07 · unverdicted · novelty 7.0

Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.

SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

SmellBench creates 294 controlled refactoring cases across 7 smell types from 7 repositories and finds the strongest agent-LLM pair reaches only 50.34 on smell elimination due to local focus and weak cross-file reasoning.

TeleSWEBench: A Commit-Driven Benchmark for Evaluating LLM-Powered Software Engineering in Telecommunications

cs.SE · 2026-06-03 · unverdicted · novelty 7.0

Presents TeleSWEBench, the first commit-driven benchmark with 734 unit-test cases from srsRAN 5G plus TeleJudge LLM evaluator, showing top ASE tools achieve up to 25% functional success on telecom tasks.

BIRDS: Characterizing and Understanding Biodiversity Impact of Large Language Model Serving

q-bio.OT · 2026-05-26 · unverdicted · novelty 7.0

BIRDS framework quantifies request-level biodiversity impacts of LLM serving via operational and embodied pathways and introduces QNBI to jointly assess impact and quality, showing accumulation at scale across workloads, models, GPUs, and regions.

When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

cs.SE · 2026-05-14 · accept · novelty 7.0

Stale repository context in code RAG actively induces models to produce obsolete helper references, raising stale outputs by 76-88 percentage points over current-only retrieval in a 17-sample diagnostic study.

KV Cache Offloading for Context-Intensive Tasks

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 4 refs

KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.

Toward Executable Repository-Level Code Generation via Environment Alignment

cs.SE · 2026-04-04 · unverdicted · novelty 7.0

EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.

ABTest: Behavior-Driven Testing for AI Coding Agents

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.

Story Point Estimation Using Large Language Models

cs.SE · 2026-03-06 · unverdicted · novelty 7.0

LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.

PerfCoder: Large Language Models for Interpretable Code Performance Optimization

cs.SE · 2025-12-16 · unverdicted · novelty 7.0

PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than prior models while producing interpretable feedback.

Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software

cs.SE · 2025-10-17 · unverdicted · novelty 7.0

LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

cs.SE · 2025-04-03 · unverdicted · novelty 7.0

Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.

PACE: A Proxy for Agentic Capability Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

PACE builds proxy benchmarks from non-agentic instances via relevance and global selection plus regression to predict agentic scores with MAE under 4%, Spearman correlation above 0.80, and 85% ranking accuracy at under 1% cost.

SWE-Router: Routing in Multi-turn Agentic Software Engineering Tasks

cs.SE · 2026-06-30 · unverdicted · novelty 6.0

SWE-Router introduces trajectory-conditioned value-based routing for LLM agents on SWE tasks, with a Bayes-optimality theorem and empirical cost savings while retaining most strong-model performance.

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

PROPEL amortizes solver evaluation with a trained activation probe to optimize task generators toward a target solve rate, raising the share of learnable tasks from ~10% to ~20% in coding and SWE experiments.

End-to-End Context Compression at Scale

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

cs.SE · 2026-06-04 · unverdicted · novelty 6.0

Code2LoRA generates repo-specific LoRA adapters via hypernetwork for code LMs, matching per-repo LoRA on static tasks and exceeding shared LoRA by 5.2 pp on evolving code in a 604-repo benchmark.

Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix

cs.IR · 2026-05-18 · unverdicted · novelty 6.0

A q-log odds variant of BM25 raises NDCG@10 by 89% relative on CodeSearchNet Go under fixed generic tokenization while recovering standard BM25 at q=1.

Contextualized Code Pretraining for Code Generation

cs.SE · 2026-05-18 · unverdicted · novelty 6.0

Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

cs.AR · 2026-05-17 · unverdicted · novelty 6.0

VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.

citing papers explorer

Showing 1 of 1 citing paper after filters.

A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 171 · internal anchor
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer