VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
hub Canonical reference
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench supports both Python and Java and consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/Leolty/repobench.
hub tools
citation-role summary
citation-polarity summary
polarities
background 9representative citing papers
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
RepoGenesis benchmark shows top AI systems reach only 23.67% Pass@1 on full microservice repository generation despite up to 73.91% API coverage and 100% deployment success.
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
BIRDS framework quantifies request-level biodiversity impacts of LLM serving via operational and embodied pathways and introduces QNBI to jointly assess impact and quality, showing accumulation at scale across workloads, models, GPUs, and regions.
Stale repository context in code RAG actively induces models to produce obsolete helper references, raising stale outputs by 76-88 percentage points over current-only retrieval in a 17-sample diagnostic study.
KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.
EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.
ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.
PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than prior models while producing interpretable feedback.
LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.
A q-log odds variant of BM25 raises NDCG@10 by 89% relative on CodeSearchNet Go under fixed generic tokenization while recovering standard BM25 at q=1.
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
DualKV eliminates redundant prompt replication in RL training attention kernels via fused dual-KV CUDA operations and token repacking, delivering 1.63-3.82x policy-update speedups while remaining mathematically equivalent to standard attention.
Introduces a three-fold benchmark for LLM-generated code combining correctness testing on a complex project, quality verification, and developer surveys to assess production readiness.
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.
CompilerKV uses offline-compiled retention tables as portable priors to achieve SOTA prefill-only KV compression performance across backbones at low token budgets.
citing papers explorer
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
-
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
-
RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository
RepoGenesis benchmark shows top AI systems reach only 23.67% Pass@1 on full microservice repository generation despite up to 73.91% API coverage and 100% deployment success.
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
-
BIRDS: Characterizing and Understanding Biodiversity Impact of Large Language Model Serving
BIRDS framework quantifies request-level biodiversity impacts of LLM serving via operational and embodied pathways and introduces QNBI to jointly assess impact and quality, showing accumulation at scale across workloads, models, GPUs, and regions.
-
When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context
Stale repository context in code RAG actively induces models to produce obsolete helper references, raising stale outputs by 76-88 percentage points over current-only retrieval in a 17-sample diagnostic study.
-
KV Cache Offloading for Context-Intensive Tasks
KV offloading degrades accuracy on context-intensive tasks due to low-rank key projections and unreliable landmarks; a simpler alternative improves results across models and benchmarks.
-
Toward Executable Repository-Level Code Generation via Environment Alignment
EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.
-
ABTest: Behavior-Driven Testing for AI Coding Agents
ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
-
Story Point Estimation Using Large Language Models
LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.
-
PerfCoder: Large Language Models for Interpretable Code Performance Optimization
PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than prior models while producing interpretable feedback.
-
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software
LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
-
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.
-
Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix
A q-log odds variant of BM25 raises NDCG@10 by 89% relative on CodeSearchNet Go under fixed generic tokenization while recovering standard BM25 at q=1.
-
Contextualized Code Pretraining for Code Generation
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
-
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
-
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
DualKV eliminates redundant prompt replication in RL training attention kernels via fused dual-KV CUDA operations and token repacking, delivering 1.63-3.82x policy-update speedups while remaining mathematically equivalent to standard attention.
-
Evaluating LLM-Generated Code: A Benchmark and Developer Study
Introduces a three-fold benchmark for LLM-generated code combining correctness testing on a complex project, quality verification, and developer surveys to assess production readiness.
-
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
-
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
-
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.
-
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
-
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
AsyncTLS delivers full-attention accuracy with 1.2-10x operator speedups and 1.3-4.7x end-to-end throughput gains on 48k-96k contexts via two-level sparse attention and asynchronous offloading.
-
CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation
CompilerKV uses offline-compiled retention tables as portable priors to achieve SOTA prefill-only KV compression performance across backbones at low token budgets.
-
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.
-
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM
K-VEC is a coverage-aware KV-cache eviction strategy using cross-head and cross-layer modules that improves performance by up to 10.35 points over prior methods on LongBench subsets at fixed memory budget.
-
I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications
A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.
-
Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.
-
Can LLMs be Effective Code Contributors? A Study on Open-source Projects
LLMs achieve only 0-60% success when asked to contribute code to sizable open-source projects, often failing basic checks or simply repeating training data.
-
Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism
A staged LLM pipeline synthesizes verifiable discrete-event world models from natural language specifications using the DEVS formalism for long-horizon consistency in LLM agents.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
- SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding