super hub Canonical reference

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Alexander Wettig, Carlos E. Jimenez, John Yang, Kexin Pei, Ofir Press, Shunyu Yao · 2023 · cs.CL · arXiv 2310.06770

Canonical reference. 82% of citing Pith papers cite this work as background.

382 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 382 citing papers more from Alexander Wettig arXiv PDF

abstract

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 56 dataset 10 baseline 3 method 3

citation-polarity summary

background 59 use dataset 6 baseline 3 use method 3 support 1

claims ledger

abstract Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a

authors

Alexander Wettig Carlos E. Jimenez John Yang Kexin Pei Ofir Press Shunyu Yao

co-cited works

representative citing papers

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.

Heimdall: Formally Verified Automated Migration of Legacy eBPF Programs to Rust

cs.CR · 2026-05-25 · unverdicted · novelty 8.0

Heimdall automates translation of eBPF C programs to Rust with formal equivalence proofs for 94.1% of 102 tested programs using LLMs, static analysis, and Z3-based checking.

ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

cs.CR · 2026-05-13 · conditional · novelty 8.0

ExploitBench decomposes LLM exploitation into 16 oracle-verified capability flags and finds public frontier models trigger crashes but rarely reach arbitrary code execution on 41 V8 bugs.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

cs.AI · 2026-05-10 · unverdicted · novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis

quant-ph · 2026-04-23 · conditional · novelty 8.0

StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

cs.AI · 2026-04-16 · unverdicted · novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

cs.SE · 2026-03-25 · unverdicted · novelty 8.0

SlopCodeBench shows coding agents degrade in structural quality and verbosity across iterative extensions, with no agent solving any problem completely and agent code 2x more eroded than human code.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

A$^{2}$utoLPBench: An Auto-Generated, Agent-Friendly LP Benchmark via Inverse-KKT Construction

cs.AI · 2026-07-02 · conditional · novelty 7.0

A²utoLPBench is a generator that produces unlimited LP word problems with ground-truth answers known by construction via inverse-KKT, bundled with a Docker environment for agent evaluation.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research

cs.AR · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

CHIA introduces a framework for building and deploying agentic AI co-design flows as CHIA loops with tool nodes, reliability mechanisms, and five case-study demonstrations.

Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents

cs.MA · 2026-06-25 · accept · novelty 7.0

Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.

An Empirical Study of LLM-Generated Specifications for VeriFast

cs.SE · 2026-06-25 · unverdicted · novelty 7.0

LLMs preserve functional behavior in over 91% of generated VeriFast specifications and source code but achieve only 31.4% verification success, with 94% of failures due to separation logic domain knowledge errors.

citing papers explorer

Showing 39 of 39 citing papers after filters.

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete cs.LG · 2026-06-01 · unverdicted · none · ref 55 · internal anchor
Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.
daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization cs.LG · 2026-06-15 · unverdicted · none · ref 4 · internal anchor
daVinci-kernel is a multi-agent RL system that co-evolves skill selection, policy generation, and summarization via shared LLM and REINFORCE to optimize GPU kernels, reporting higher KernelBench scores than prior RL models.
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks cs.LG · 2026-06-10 · conditional · none · ref 16 · internal anchor
Claw-SWE-Bench is a 350-instance multilingual benchmark for OpenClaw-style agent harnesses that shows adapter design raises Pass@1 from 19.1% to 73.4% on the same model while releasing data for reproducible comparison.
Bittensor Agent Arenas as a Trajectory Primitive: Distilling a Shopping Agent from ShoppingBench Subnet Traces cs.LG · 2026-06-08 · unverdicted · none · ref 6 · internal anchor
Trajectories from a Bittensor ShoppingBench subnet arena, filtered to retain only agentic tool-calling behavior, enable SFT+GRPO post-training of Qwen3-4B to 42.7% ASR on leak-guarded held-out tests, nearly matching synthetic-data baselines with a fraction of a day's data.
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics cs.LG · 2026-05-17 · unverdicted · none · ref 22 · 2 links · internal anchor
FML-Bench shows a simple greedy hill-climber nearly matches tree search on dense-opportunity tasks while an adaptive agent that broadens search on stagnation outperforms six baselines across 18 tasks.
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs cs.LG · 2026-05-15 · unverdicted · none · ref 16 · internal anchor
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction cs.LG · 2026-05-13 · unverdicted · none · ref 17 · internal anchor
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use cs.LG · 2026-05-03 · unverdicted · none · ref 33 · internal anchor
The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice cs.LG · 2026-05-02 · unverdicted · none · ref 15 · internal anchor
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 55 · internal anchor
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents cs.LG · 2026-03-13 · unverdicted · none · ref 12 · internal anchor
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
Automatic Generation of High-Performance RL Environments cs.LG · 2026-03-12 · conditional · none · ref 4 · internal anchor
Closed-loop prompt-based translation with hierarchical verification and iterative repair produces equivalent high-performance RL environments across five cases including new TCGJax.
Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization cs.LG · 2026-07-01 · unverdicted · none · ref 25 · internal anchor
Active-GRPO reaches 0.1773 average SRxSim on TOMG-Bench MOLOPT by adaptively switching between imitation and self-reinforcement while upgrading references, outperforming GRPO and RePO.
K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling cs.LG · 2026-06-09 · unverdicted · none · ref 63 · internal anchor
K-Forcing introduces progressive self-forcing distillation to train a conditional push-forward model that jointly decodes k future tokens per forward pass, yielding 2.4-3.5x speedup at k=4 with modest quality loss on LM1B and OpenWebText.
Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents cs.LG · 2026-06-04 · unverdicted · none · ref 10 · internal anchor
ADWM learns a latent diffusion world model with per-transition independent denoising and policy-conditioned guidance to enable accurate offline evaluation of LLM agent policies.
Merge-Bench: Resolve Merge Conflicts with Large Language Models cs.LG · 2026-05-25 · unverdicted · none · ref 9 · internal anchor
Merge-Bench dataset and LLMergeJ model demonstrate that a 14B-parameter LLM trained with GRPO outperforms some commercial models on Java merge resolution but all tested models resolve under 60% of conflicts across 11 languages.
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale cs.LG · 2026-05-20 · unverdicted · none · ref 46 · internal anchor
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
Harnesses for Inference-Time Alignment over Execution Trajectories cs.LG · 2026-05-15 · unverdicted · none · ref 25 · internal anchor
Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal benchmarks.
Revisiting DAgger in the Era of LLM-Agents cs.LG · 2026-05-13 · conditional · none · ref 19 · internal anchor
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution cs.LG · 2026-05-08 · unverdicted · none · ref 14 · internal anchor
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 52 · internal anchor
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
QASM-Eval: A Dataset to Train and Evaluate LLMs on OpenQASM-3 Beyond Quantum Circuits cs.LG · 2026-04-28 · unverdicted · none · ref 19 · internal anchor
Introduces QASM-Eval, the first dataset targeting OpenQASM-3 hardware-facing features for LLM training and evaluation, with an extended verifier for syntax, states, and timelines.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training cs.LG · 2026-04-26 · unverdicted · none · ref 22 · internal anchor
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
PrismaDV: Automated Task-Aware Data Unit Test Generation cs.LG · 2026-04-23 · unverdicted · none · ref 34 · internal anchor
PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt optimization that beats hand-written prompts.
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search cs.LG · 2026-03-02 · unverdicted · none · ref 16 · internal anchor
Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.
Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning cs.LG · 2026-02-06 · unverdicted · none · ref 7 · internal anchor
Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression cs.LG · 2025-11-26 · unverdicted · none · ref 27 · internal anchor
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training cs.LG · 2025-09-03 · unverdicted · none · ref 13 · internal anchor
PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling cs.LG · 2024-07-31 · unverdicted · none · ref 35 · internal anchor
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters cs.LG · 2026-06-01 · unverdicted · none · ref 12 · internal anchor
PEFT adapters are positioned as persistent personal state on foundation models, organized via Scale Up, Scale Down, and Scale Out axes, with MinT as an infrastructure example for managing them.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 13 · internal anchor
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models cs.LG · 2026-03-31 · unverdicted · none · ref 4 · internal anchor
Language models display model-specific escalation thresholds in uncertain decisions that are not explained by scale or architecture, and supervised fine-tuning on explicit uncertainty reasoning produces robust, generalizable policies.
GLM-5: from Vibe Coding to Agentic Engineering cs.LG · 2026-02-17 · unverdicted · none · ref 19 · internal anchor
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering cs.LG · 2026-02-08 · unverdicted · none · ref 7 · internal anchor
AceGRPO trains 30B-parameter LLM agents to achieve 100% valid submissions and competitive performance on MLE-Bench-Lite through evolving data buffers and adaptive task sampling.
Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism cs.LG · 2025-10-30 · unverdicted · none · ref 16 · internal anchor
Nirvana adds a task-aware memory trigger and updater to specialized generalist models, achieving strong general benchmark results, lowest perplexity in biomedicine/finance/law, and improved MRI reconstruction fidelity.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 28 · internal anchor
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
The Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning cs.LG · 2026-07-01 · unverdicted · none · ref 11 · internal anchor
Comparative study of four execution substrates for coding-agent RL rollouts finds 110x cold-start latency variation and 1.8x spread in worker-hours for one million 150-step trajectories.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems cs.LG · 2026-01-20 · unverdicted · none · ref 80 · internal anchor
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench cs.LG · 2026-01-28 · unreviewed · ref 11 · internal anchor

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer