pith. sign in

arxiv: 2601.11895 · v3 · pith:VXUT3H5Anew · submitted 2026-01-17 · 💻 cs.LG · cs.AI· cs.SE

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

classification 💻 cs.LG cs.AIcs.SE
keywords benchmarkmodelsbenchmarkscodedevbenchevaluationmodelpractical
0
0 comments X
read the original abstract

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry and synthesized using generator models from multiple provider families to mitigate single-source bias. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, with the strongest achieving only 43.5% Pass@1, confirming the benchmark remains challenging and revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement, detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

    cs.SE 2026-05 unverdicted novelty 7.0

    SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.

  2. SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

    cs.SE 2026-05 unverdicted novelty 6.0

    SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.