hub Canonical reference

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Tianyang Liu, Canwen Xu, Julian McAuley · 2023 · cs.CL · arXiv 2306.03091

Canonical reference. 100% of citing Pith papers cite this work as background.

35 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 2 method 1

citation-polarity summary

background 9

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository

cs.SE · 2026-01-20 · accept · novelty 8.0

RepoGenesis benchmark shows top AI systems reach only 23.67% Pass@1 on full microservice repository generation despite up to 73.91% API coverage and 100% deployment success.

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

cs.CL · 2023-08-28 · unverdicted · novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

BIRDS: Characterizing and Understanding Biodiversity Impact of Large Language Model Serving

q-bio.OT · 2026-05-26 · unverdicted · novelty 7.0

BIRDS framework quantifies request-level biodiversity impacts of LLM serving via operational and embodied pathways and introduces QNBI to jointly assess impact and quality, showing accumulation at scale across workloads, models, GPUs, and regions.

When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

cs.SE · 2026-05-14 · accept · novelty 7.0

Stale repository context in code RAG actively induces models to produce obsolete helper references, raising stale outputs by 76-88 percentage points over current-only retrieval in a 17-sample diagnostic study.

KV Cache Offloading for Context-Intensive Tasks

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 4 refs

KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.

Toward Executable Repository-Level Code Generation via Environment Alignment

cs.SE · 2026-04-04 · unverdicted · novelty 7.0

EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.

ABTest: Behavior-Driven Testing for AI Coding Agents

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.

Story Point Estimation Using Large Language Models

cs.SE · 2026-03-06 · unverdicted · novelty 7.0

LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.

PerfCoder: Large Language Models for Interpretable Code Performance Optimization

cs.SE · 2025-12-16 · unverdicted · novelty 7.0

PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than prior models while producing interpretable feedback.

Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software

cs.SE · 2025-10-17 · unverdicted · novelty 7.0

LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

cs.SE · 2025-04-03 · unverdicted · novelty 7.0

Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.

Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix

cs.IR · 2026-05-18 · unverdicted · novelty 6.0

A q-log odds variant of BM25 raises NDCG@10 by 89% relative on CodeSearchNet Go under fixed generic tokenization while recovering standard BM25 at q=1.

Contextualized Code Pretraining for Code Generation

cs.SE · 2026-05-18 · unverdicted · novelty 6.0

Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

cs.AR · 2026-05-17 · unverdicted · novelty 6.0

VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

cs.LG · 2026-05-14 · unverdicted · novelty 6.0 · 2 refs

DualKV eliminates redundant prompt replication in RL training attention kernels via fused dual-KV CUDA operations and token repacking, delivering 1.63-3.82x policy-update speedups while remaining mathematically equivalent to standard attention.

Evaluating LLM-Generated Code: A Benchmark and Developer Study

cs.SE · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

Introduces a three-fold benchmark for LLM-generated code combining correctness testing on a complex project, quality verification, and developer surveys to assess production readiness.

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

cs.SE · 2026-04-30 · unverdicted · novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

cs.NI · 2026-04-23 · unverdicted · novelty 6.0

SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.

When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation

cs.SE · 2026-04-10 · unverdicted · novelty 6.0

LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.

AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

cs.LG · 2026-02-09 · unverdicted · novelty 6.0

CompilerKV uses offline-compiled retention tables as portable priors to achieve SOTA prefill-only KV compression performance across backbones at low token budgets.

citing papers explorer

Showing 35 of 35 citing papers.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? cs.AI · 2026-05-07 · unverdicted · none · ref 46 · internal anchor
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis cs.CL · 2026-04-14 · unverdicted · none · ref 17 · internal anchor
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository cs.SE · 2026-01-20 · accept · none · ref 3 · internal anchor
RepoGenesis benchmark shows top AI systems reach only 23.67% Pass@1 on full microservice repository generation despite up to 73.91% API coverage and 100% deployment success.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding cs.CL · 2023-08-28 · unverdicted · none · ref 101 · internal anchor
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
BIRDS: Characterizing and Understanding Biodiversity Impact of Large Language Model Serving q-bio.OT · 2026-05-26 · unverdicted · none · ref 27 · internal anchor
BIRDS framework quantifies request-level biodiversity impacts of LLM serving via operational and embodied pathways and introduces QNBI to jointly assess impact and quality, showing accumulation at scale across workloads, models, GPUs, and regions.
When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context cs.SE · 2026-05-14 · accept · none · ref 5 · internal anchor
Stale repository context in code RAG actively induces models to produce obsolete helper references, raising stale outputs by 76-88 percentage points over current-only retrieval in a 17-sample diagnostic study.
KV Cache Offloading for Context-Intensive Tasks cs.LG · 2026-04-09 · conditional · none · ref 32 · 4 links · internal anchor
KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.
Toward Executable Repository-Level Code Generation via Environment Alignment cs.SE · 2026-04-04 · unverdicted · none · ref 19 · internal anchor
EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.
ABTest: Behavior-Driven Testing for AI Coding Agents cs.SE · 2026-04-03 · unverdicted · none · ref 13 · internal anchor
ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
Story Point Estimation Using Large Language Models cs.SE · 2026-03-06 · unverdicted · none · ref 23 · internal anchor
LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.
PerfCoder: Large Language Models for Interpretable Code Performance Optimization cs.SE · 2025-12-16 · unverdicted · none · ref 22 · internal anchor
PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than prior models while producing interpretable feedback.
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software cs.SE · 2025-10-17 · unverdicted · none · ref 26 · internal anchor
LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving cs.SE · 2025-04-03 · unverdicted · none · ref 12 · internal anchor
Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.
Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix cs.IR · 2026-05-18 · unverdicted · none · ref 10 · internal anchor
A q-log odds variant of BM25 raises NDCG@10 by 89% relative on CodeSearchNet Go under fixed generic tokenization while recovering standard BM25 at q=1.
Contextualized Code Pretraining for Code Generation cs.SE · 2026-05-18 · unverdicted · none · ref 31 · internal anchor
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference cs.AR · 2026-05-17 · unverdicted · none · ref 37 · internal anchor
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts cs.LG · 2026-05-14 · unverdicted · none · ref 16 · 2 links · internal anchor
DualKV eliminates redundant prompt replication in RL training attention kernels via fused dual-KV CUDA operations and token repacking, delivering 1.63-3.82x policy-update speedups while remaining mathematically equivalent to standard attention.
Evaluating LLM-Generated Code: A Benchmark and Developer Study cs.SE · 2026-05-09 · unverdicted · none · ref 10 · 2 links · internal anchor
Introduces a three-fold benchmark for LLM-generated code combining correctness testing on a complex project, quality verification, and developer surveys to assess production readiness.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference cs.CL · 2026-05-08 · unverdicted · none · ref 31 · internal anchor
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows cs.SE · 2026-04-30 · unverdicted · none · ref 22 · internal anchor
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference cs.NI · 2026-04-23 · unverdicted · none · ref 43 · internal anchor
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation cs.SE · 2026-04-10 · unverdicted · none · ref 32 · internal anchor
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention cs.CL · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.
CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation cs.LG · 2026-02-09 · unverdicted · none · ref 13 · internal anchor
CompilerKV uses offline-compiled retention tables as portable priors to achieve SOTA prefill-only KV compression performance across backbones at low token budgets.
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks cs.CL · 2024-12-19 · accept · none · ref 4 · internal anchor
LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference cs.CL · 2024-07-16 · accept · none · ref 68 · internal anchor
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 278 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 226 · internal anchor
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM cs.CL · 2026-06-28 · unverdicted · none · ref 75 · internal anchor
K-VEC is a coverage-aware KV-cache eviction strategy using cross-head and cross-layer modules that improves performance by up to 10.35 points over prior methods on LongBench subsets at fixed memory budget.
I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications cs.CL · 2026-05-30 · unverdicted · none · ref 25 · internal anchor
A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.
Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning cs.AI · 2026-05-14 · unverdicted · none · ref 25 · internal anchor
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.
Can LLMs be Effective Code Contributors? A Study on Open-source Projects cs.SE · 2026-04-25 · unverdicted · none · ref 12 · internal anchor
LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.
Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism cs.AI · 2026-03-04 · unverdicted · none · ref 25 · internal anchor
A staged LLM pipeline synthesizes verifiable discrete-event world models from natural language specifications using the DEVS formalism for long-horizon consistency in LLM agents.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 171 · internal anchor
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding cs.DC · 2026-02-10 · unreviewed · ref 28 · internal anchor

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer