SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Amir Yazdanbakhsh; Enhui Li; Jeffrey Jian Ma; Kevin Swersky; Milad Hashemi; Ofir Press; Parthasarathy Ranganathan; Vijay Janapa Reddi

arxiv: 2511.06090 · v3 · pith:JDREUXQCnew · submitted 2025-11-08 · 💻 cs.SE · cs.AI· cs.PF

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Jeffrey Jian Ma , Milad Hashemi , Amir Yazdanbakhsh , Kevin Swersky , Ofir Press , Enhui Li , Vijay Janapa Reddi , Parthasarathy Ranganathan This is my paper

classification 💻 cs.SE cs.AIcs.PF

keywords agentscodeexpertperformancereasoningrepositoriessoftwarespeedup

0 comments

read the original abstract

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static analysis, coverage tooling, and execution validation to both confirm expert speedup baselines and identify relevant repository unit tests. Empirical evaluation of state-of-the-art agents reveals significant underperformance. On average, agents achieve less than 0.23x the expert speedup: agents struggle in localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits. We release the benchmark and accompanying data pipeline to facilitate research on automated performance engineering and long-horizon software reasoning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL
cs.LG 2026-05 conditional novelty 7.0

Extrapolative weight averaging of RL checkpoints trained under nested unit-test coverage extends a correctness-efficiency frontier and boosts ensemble pass rates in code generation across model scales and inference modes.
CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits
cs.SE 2026-05 accept novelty 7.0

CppPerf-Mine produces CppPerf-DB, a benchmark of 347 real-world performance-improving C++ patches (39% multi-file) from 42 repositories to evaluate repository-level repair tools.
JETO-Bench: A Reproducible Benchmark for Execution Time Improvement Patches in Java
cs.SE 2026-06 conditional novelty 6.0

JETO-Mine is a reusable three-phase pipeline that mines 1.8 million Java commits to produce JETO-Bench containing 91 verified executable ETIPs, on which OpenHands succeeds at 14.3%.