pith. machine review for the scientific record. sign in

arxiv: 2512.18470 · v5 · submitted 2025-12-20 · 💻 cs.SE · cs.AI· cs.MA

Recognition: unknown

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Authors on Pith no claims yet
classification 💻 cs.SE cs.AIcs.MA
keywords long-horizonswe-evoagentssoftwaretaskscodingevolutionfiles
0
0 comments X
read the original abstract

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  2. SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

    cs.SE 2026-05 unverdicted novelty 7.0

    SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

  3. Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

    cs.SE 2026-05 unverdicted novelty 7.0

    TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.

  4. ProgramBench: Can Language Models Rebuild Programs From Scratch?

    cs.SE 2026-05 unverdicted novelty 7.0

    ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...

  5. Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems

    cs.AI 2026-04 unverdicted novelty 7.0

    A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.

  6. The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

    cs.SE 2026-05 unverdicted novelty 5.0

    Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.

  7. More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems

    cs.SE 2026-04 unverdicted novelty 5.0

    AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.