pith. sign in

super hub Canonical reference

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Canonical reference. 82% of citing Pith papers cite this work as background.

319 Pith papers citing it
Background 82% of classified citations
abstract

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

hub tools

citation-role summary

background 56 dataset 10 baseline 3 method 3

citation-polarity summary

claims ledger

  • abstract Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a

authors

co-cited works

clear filters

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents

cs.MA · 2026-06-25 · accept · novelty 7.0

Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.

LLM Agents Can See Code Repositories

cs.SE · 2026-06-12 · unverdicted · novelty 7.0

Visual graphs of repository structure added to text inputs for multimodal LLM agents reduce token consumption by up to 26% while maintaining or improving issue-resolution accuracy.

citing papers explorer

Showing 17 of 17 citing papers after filters.

  • Heimdall: Formally Verified Automated Migration of Legacy eBPF Programs to Rust cs.CR · 2026-05-25 · unverdicted · none · ref 34 · internal anchor

    Heimdall automates translation of eBPF C programs to Rust with formal equivalence proofs for 94.1% of 102 tested programs using LLMs, static analysis, and Z3-based checking.

  • ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents cs.CR · 2026-05-13 · conditional · none · ref 3 · internal anchor

    ExploitBench decomposes LLM exploitation into 16 oracle-verified capability flags and finds public frontier models trigger crashes but rarely reach arbitrary code execution on 41 V8 bugs.

  • ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation cs.CR · 2025-07-14 · unverdicted · none · ref 20 · internal anchor

    ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

  • Do Coding Agents Understand Least-Privilege Authorization? cs.CR · 2026-05-14 · unverdicted · none · ref 12 · internal anchor

    Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.

  • SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces cs.CR · 2026-05-12 · unverdicted · none · ref 62 · internal anchor

    SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.

  • CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios cs.CR · 2026-05-08 · unverdicted · none · ref 8 · internal anchor

    LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.

  • MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents cs.CR · 2026-05-05 · unverdicted · none · ref 3 · internal anchor

    MOSAIC-Bench demonstrates that nine production coding agents achieve 53-86% end-to-end attack success rates on staged innocuous tickets across 10 web substrates and 31 CWE classes, far higher than the 0-20.4% rates seen with direct prompts.

  • Feedback-Driven Execution for LLM-Based Binary Analysis cs.CR · 2026-04-16 · unverdicted · none · ref 20 · internal anchor

    FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.

  • RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code cs.CR · 2026-04-15 · unverdicted · none · ref 15 · internal anchor

    RealVuln benchmark finds security-specialized scanners outperform general-purpose LLMs and rule-based SAST tools on hand-labeled vulnerable Python code under F3 scoring, with all artifacts released.

  • Formal Policy Enforcement for Real-World Agentic Systems cs.CR · 2026-02-18 · unverdicted · none · ref 37 · internal anchor

    FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.

  • LLM Agents can Autonomously Exploit One-day Vulnerabilities cs.CR · 2024-04-11 · unverdicted · none · ref 9 · internal anchor

    GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.

  • unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning cs.CR · 2026-05-27 · unverdicted · none · ref 5 · internal anchor

    unix-ctf procedurally generates 656 Unix CTF tasks across 155 techniques; fine-tuning Qwen3-8B on them raises solve rate from 11.6% to 43.6% on a 15-skill holdout and yields +33 pp in Forensics on InterCode-CTF.

  • Root-Cause-Driven Automated Vulnerability Repair cs.CR · 2026-05-05 · unverdicted · none · ref 27 · internal anchor

    Kumushi improves automated vulnerability repair by focusing LLM edits on root causes via dynamic localization and ranking, yielding more genuine fixes than prior agents on 178 C/C++ vulnerabilities.

  • From Context to Rules: Toward Unified Detection Rule Generation cs.CR · 2026-04-13 · unverdicted · none · ref 6 · internal anchor

    UniRule formalizes detection rule generation as a unified mapping from contexts and languages to rules and uses dual semantic projections in an agentic RAG setup to outperform direct LLM generation across 12 scenarios with a Bradley-Terry score of 0.52.

  • SoK: Agentic Skills -- Beyond Tool Use in LLM Agents cs.CR · 2026-02-24 · unverdicted · none · ref 63 · internal anchor

    The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

  • CoT-Guard: Small Models for Strong Monitoring cs.CR · 2026-05-12 · unverdicted · none · ref 4 · internal anchor

    CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

  • AgentCrypt: Advancing Privacy and (Secure) Computation in AI Agent Collaboration cs.CR · 2025-12-08 · unverdicted · none · ref 20 · internal anchor

    AgentCrypt introduces a deterministic three-tier privacy framework for AI agent collaboration that uses masking and homomorphic encryption to protect data independently of model accuracy.