The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.
Copilot arena: A platform for code llm evaluation in the wild.arXiv preprint arXiv:2502.09328
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
RECAP captures, replays, and analyzes AI-assisted programming sessions by linking prompts, edits, and developer actions in a single timeline.
Mercury Coder diffusion LLMs achieve throughputs of 1109 and 737 tokens per second on H100 GPUs, up to 10x faster than frontier models with comparable quality.
citing papers explorer
-
Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks
The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.
-
RECAP: An End-to-End Platform for Capturing, Replaying, and Analyzing AI-Assisted Programming Interactions
RECAP captures, replays, and analyzes AI-assisted programming sessions by linking prompts, edits, and developer actions in a single timeline.
-
Mercury: Ultra-Fast Language Models Based on Diffusion
Mercury Coder diffusion LLMs achieve throughputs of 1109 and 737 tokens per second on H100 GPUs, up to 10x faster than frontier models with comparable quality.