super hub Canonical reference

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Alexander Wettig, Carlos E. Jimenez, John Yang, Kexin Pei, Ofir Press, Shunyu Yao · 2023 · cs.CL · arXiv 2310.06770

Canonical reference. 82% of citing Pith papers cite this work as background.

366 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 366 citing papers more from Alexander Wettig arXiv PDF

abstract

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 56 dataset 10 baseline 3 method 3

citation-polarity summary

background 59 use dataset 6 baseline 3 use method 3 support 1

claims ledger

abstract Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a

authors

Alexander Wettig Carlos E. Jimenez John Yang Kexin Pei Ofir Press Shunyu Yao

co-cited works

representative citing papers

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.

Heimdall: Formally Verified Automated Migration of Legacy eBPF Programs to Rust

cs.CR · 2026-05-25 · unverdicted · novelty 8.0

Heimdall automates translation of eBPF C programs to Rust with formal equivalence proofs for 94.1% of 102 tested programs using LLMs, static analysis, and Z3-based checking.

ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

cs.CR · 2026-05-13 · conditional · novelty 8.0

ExploitBench decomposes LLM exploitation into 16 oracle-verified capability flags and finds public frontier models trigger crashes but rarely reach arbitrary code execution on 41 V8 bugs.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

cs.AI · 2026-05-10 · unverdicted · novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis

quant-ph · 2026-04-23 · conditional · novelty 8.0

StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

cs.AI · 2026-04-16 · unverdicted · novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

cs.SE · 2026-03-25 · unverdicted · novelty 8.0

SlopCodeBench shows coding agents degrade in structural quality and verbosity across iterative extensions, with no agent solving any problem completely and agent code 2x more eroded than human code.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

A$^{2}$utoLPBench: An Auto-Generated, Agent-Friendly LP Benchmark via Inverse-KKT Construction

cs.AI · 2026-07-02 · conditional · novelty 7.0

A²utoLPBench is a generator that produces unlimited LP word problems with ground-truth answers known by construction via inverse-KKT, bundled with a Docker environment for agent evaluation.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research

cs.AR · 2026-06-25 · unverdicted · novelty 7.0

CHIA introduces a framework for building and deploying agentic AI co-design flows as CHIA loops with tool nodes, reliability mechanisms, and five case-study demonstrations.

Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents

cs.MA · 2026-06-25 · accept · novelty 7.0

Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.

RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

cs.SE · 2026-06-21 · unverdicted · novelty 7.0

RigorBench evaluates AI coding agents on process discipline via five pillars and reports 41% higher process scores and 17% better outcome correctness with structured approaches on 30 tasks.

citing papers explorer

Showing 16 of 16 citing papers after filters.

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries cs.SE · 2026-05-09 · conditional · none · ref 14 · internal anchor
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding cs.SE · 2026-04-05 · accept · none · ref 30 · internal anchor
SADU benchmark shows top VLMs reach only 70% accuracy on software architecture diagram tasks, revealing gaps in visual reasoning for engineering artifacts.
Toward Executable Repository-Level Code Generation via Environment Alignment cs.SE · 2026-04-04 · unverdicted · none · ref 12 · internal anchor
EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits cs.SE · 2026-04-03 · conditional · none · ref 17 · internal anchor
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation cs.SE · 2023-05-02 · accept · none · ref 28 · internal anchor
EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle cs.SE · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation cs.SE · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture cs.SE · 2026-05-02 · unverdicted · none · ref 13 · internal anchor
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows cs.SE · 2026-04-30 · unverdicted · none · ref 15 · internal anchor
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
You Don't Need Public Tests to Generate Correct Code cs.SE · 2026-04-23 · unverdicted · none · ref 11 · internal anchor
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents cs.SE · 2026-04-20 · unverdicted · none · ref 45 · internal anchor
SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation cs.SE · 2026-04-10 · unverdicted · none · ref 19 · internal anchor
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics cs.SE · 2026-04-06 · unverdicted · none · ref 12 · internal anchor
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints cs.SE · 2026-04-06 · unverdicted · none · ref 20 · internal anchor
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
A meta-analysis of the effect of generative AI on productivity and learning in programming cs.SE · 2026-05-06 · unverdicted · none · ref 53 · internal anchor
Meta-analysis of 23 studies shows moderate productivity gains from GenAI coding assistants (Hedges' g=0.33) but no significant effect on learning (g=0.14).
Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures cs.SE · 2026-04-15 · unverdicted · none · ref 22 · internal anchor
Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer