Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel · 2025 · arXiv 2505.20411

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

representative citing papers

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

cs.SE · 2026-05-13 · conditional · novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.

Revisiting DAgger in the Era of LLM-Agents

cs.LG · 2026-05-13 · conditional · novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

Coding Agents Don't Know When to Act

cs.SE · 2026-05-08 · unverdicted · novelty 6.0

Coding agents exhibit action bias by proposing undesirable changes on already-fixed issues 35-65% of the time, and explicit reproduction instructions only partially mitigate this while creating new abstention errors.

Reproduction Test Generation for Java SWE Issues

cs.SE · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.

On Problems of Implicit Context Compression for Software Engineering Agents

cs.SE · 2026-05-11 · unverdicted · novelty 5.0

In-Context Autoencoder succeeds on single-shot common-knowledge and code tasks but fails on multi-step agentic coding tasks.

GLM-5: from Vibe Coding to Agentic Engineering

cs.LG · 2026-02-17 · unverdicted · novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

cs.AI · 2025-09-02 · conditional · novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

A Brief Overview: Agentic Reinforcement Learning In Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 2.0 · 2 refs

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-reflection into LLM-based agents.

citing papers explorer

Showing 9 of 9 citing papers.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation cs.SE · 2026-05-13 · conditional · none · ref 2
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
ProgramBench: Can Language Models Rebuild Programs From Scratch? cs.SE · 2026-05-05 · unverdicted · none · ref 1
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.
Revisiting DAgger in the Era of LLM-Agents cs.LG · 2026-05-13 · conditional · none · ref 4
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Coding Agents Don't Know When to Act cs.SE · 2026-05-08 · unverdicted · none · ref 2
Coding agents exhibit action bias by proposing undesirable changes on already-fixed issues 35-65% of the time, and explicit reproduction instructions only partially mitigate this while creating new abstention errors.
Reproduction Test Generation for Java SWE Issues cs.SE · 2026-05-05 · unverdicted · none · ref 7 · 2 links
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
On Problems of Implicit Context Compression for Software Engineering Agents cs.SE · 2026-05-11 · unverdicted · none · ref 1
In-Context Autoencoder succeeds on single-shot common-knowledge and code tasks but fails on multi-step agentic coding tasks.
GLM-5: from Vibe Coding to Agentic Engineering cs.LG · 2026-02-17 · unverdicted · none · ref 4
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning cs.AI · 2025-09-02 · conditional · none · ref 8
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 4 · 2 links
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-reflection into LLM-based agents.

Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents

fields

years

verdicts

representative citing papers

citing papers explorer