DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Adarsh Kumarappan; Elsie Nallipogu; Gabriel Ryan; Pareesa Ameneh Golnari; Shengyu Fu; Wen Wen; Xiaoyu Liu; Yuting Sun

arxiv: 2601.11895 · v3 · pith:VXUT3H5Anew · submitted 2026-01-17 · 💻 cs.LG · cs.AI· cs.SE

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu This is my paper

classification 💻 cs.LG cs.AIcs.SE

keywords benchmarkmodelsbenchmarkscodedevbenchevaluationmodelpractical

0 comments

read the original abstract

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry and synthesized using generator models from multiple provider families to mitigate single-source bias. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, with the strongest achieving only 43.5% Pass@1, confirming the benchmark remains challenging and revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement, detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
cs.SE 2026-05 unverdicted novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.