super hub Canonical reference

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Alexander Wettig, Carlos E. Jimenez, John Yang, Kexin Pei, Ofir Press, Shunyu Yao · 2023 · cs.CL · arXiv 2310.06770

Canonical reference. 82% of citing Pith papers cite this work as background.

288 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 288 citing papers more from Alexander Wettig arXiv PDF

abstract

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 56 dataset 10 baseline 3 method 3

citation-polarity summary

background 59 use dataset 6 baseline 3 use method 3 support 1

claims ledger

abstract Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a

authors

Alexander Wettig Carlos E. Jimenez John Yang Kexin Pei Ofir Press Shunyu Yao

co-cited works

representative citing papers

Heimdall: Formally Verified Automated Migration of Legacy eBPF Programs to Rust

cs.CR · 2026-05-25 · unverdicted · novelty 8.0

Heimdall automates translation of eBPF C programs to Rust with formal equivalence proofs for 94.1% of 102 tested programs using LLMs, static analysis, and Z3-based checking.

ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

cs.CR · 2026-05-13 · conditional · novelty 8.0

ExploitBench decomposes LLM exploitation into 16 oracle-verified capability flags and finds public frontier models trigger crashes but rarely reach arbitrary code execution on 41 V8 bugs.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

cs.AI · 2026-05-10 · unverdicted · novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis

quant-ph · 2026-04-23 · conditional · novelty 8.0

StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

cs.AI · 2026-04-16 · unverdicted · novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

cs.SE · 2026-03-25 · unverdicted · novelty 8.0

SlopCodeBench shows coding agents degrade in structural quality and verbosity across iterative extensions, with no agent solving any problem completely and agent code 2x more eroded than human code.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs

cs.AI · 2026-06-16 · unverdicted · novelty 7.0

LLMs brew code answers in early layers before resolving into Resolved, Overprocessed, Misresolved, or Unresolved states, with 41.5% resolved overall and brewing duration stable at 24-42% across 16 models.

LLM Agents Can See Code Repositories

cs.SE · 2026-06-12 · unverdicted · novelty 7.0

Visual graphs of repository structure added to text inputs for multimodal LLM agents reduce token consumption by up to 26% while maintaining or improving issue-resolution accuracy.

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

cs.SE · 2026-06-03 · conditional · novelty 7.0

A new six-dimension process taxonomy for AI software development frameworks shows convergence on artifact persistence and human oversight but reveals that no framework covers all dimensions strongly, indicating a depth-portability trade-off.

Sakura: An Approach for Generating Complex Tests from Natural Language Test Descriptions

cs.SE · 2026-05-30 · unverdicted · novelty 7.0

Sakura is a multi-agent system that generates structurally complex tests from NL descriptions, achieving 50-78% higher compilability and 38-66% higher coverage overlap than baselines on 1,464 scenarios from 20 Apache Commons applications.

Specialize Roles, Mix Deployments: Pushing the Cost-Accuracy Frontier of LLM Agent Teams

cs.MA · 2026-05-28 · unverdicted · novelty 7.0

AgentCARD benchmark shows heterogeneous LLM agent teams with mixed deployments reach the cost-accuracy frontier, delivering up to 44% higher accuracy or 12x lower cost than uniform teams, with domain-specific role bottlenecks.

EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

cs.SE · 2026-05-28 · unverdicted · novelty 7.0

EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.

On the Road to Personalized Code Intelligence: Portraiting and Assisting Developers Based on Their In-IDE Behaviors

cs.SE · 2026-05-28 · unverdicted · novelty 7.0

VirtualME is a new infrastructure that continuously extracts and interprets in-IDE developer behaviors to build personalized personas, delivering 33.8% better performance on repository-level knowledge Q&A than generic baselines.

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer