arxiv: 2310.06770 · v3 · submitted 2023-10-10 · 💻 cs.CL · cs.AI· cs.SE

Recognition: 2 theorem links

· Lean Theorem

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Alexander Wettig, Carlos E. Jimenez, John Yang, Karthik Narasimhan, Kexin Pei, Ofir Press, Shunyu Yao

Pith reviewed 2026-05-10 13:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE

keywords SWE-benchlanguage modelsGitHub issuessoftware engineeringcode editingbenchmarkPython repositories

0 comments

The pith

Language models resolve only 1.96% of real GitHub software issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWE-bench, a collection of 2,294 real GitHub issues drawn from 12 popular Python repositories, as a test for whether language models can edit large codebases to fix reported problems. Models receive the full codebase and the issue description but no extra tools, and must produce the necessary code changes. Even the strongest evaluated model succeeds on fewer than 2 percent of cases, showing that current systems handle only the simplest fixes while most issues require coordinated edits across multiple files and functions. This setup matters because it moves evaluation beyond isolated code snippets toward the sustained, context-heavy work that defines actual software engineering.

Core claim

We introduce SWE-bench, an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues.

What carries the argument

SWE-bench framework that supplies full repository codebases plus issue descriptions and requires models to generate coordinated edits that resolve the reported problem.

If this is right

Resolving most issues requires simultaneous understanding of multiple functions, classes, and files, which exceeds the reach of today's models.
Models must improve at long-context processing and interaction with execution environments to make headway on this benchmark.
Fine-tuning on repository-level data, as attempted with SWE-Llama, produces measurable but still small gains.
Progress measured on SWE-bench would mark steps toward language models that function as practical, autonomous software engineers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the low success rate persists, scaling model size alone is unlikely to close the gap without new mechanisms for navigating and editing large codebases.
The benchmark could be extended to other languages or to tasks that include running tests and iterating on failures.
Model training pipelines might benefit from including more examples of full-repository navigation and multi-file refactoring.

Load-bearing premise

The 2,294 curated issues and the task setup that supplies only the codebase and issue text give an unbiased picture of real-world software engineering difficulty.

What would settle it

A model that resolves substantially more than 2 percent of the SWE-bench issues under the exact same input conditions would indicate that current models are not limited to the simplest cases.

read the original abstract

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SWE-bench, a benchmark of 2,294 software engineering problems drawn from real GitHub issues and pull requests across 12 popular Python repositories. Models receive the full codebase plus an issue description and must output a patch to resolve it; evaluations show that state-of-the-art models solve only the simplest issues, with the best performer (Claude 2) succeeding on just 1.96% of cases. The work positions this as a challenging, realistic testbed for LM capabilities in multi-file reasoning and code editing.

Significance. If the benchmark and results hold, the paper supplies a sustainable, real-world-derived evaluation framework that exposes clear limitations in current LMs for practical software engineering, beyond synthetic code-generation tasks. The scale of the curated dataset and the direct measurement of patch-generation success constitute a concrete advance that can guide development of more autonomous, tool-using models.

major comments (2)

[Task formulation / evaluation protocol] Task formulation (abstract and evaluation protocol): the headline claim that Claude 2 solves only 1.96% of real-world issues rests on a single-shot, full-repo-plus-description setup that forbids search, test execution, feedback loops, or external tools. Real GitHub workflows routinely employ these operations; the protocol therefore introduces an artificial constraint whose effect on measured performance is not quantified, weakening the inference that current LMs are fundamentally limited rather than simply mismatched to the chosen interface.
[Dataset construction] Data curation and leakage analysis (dataset construction section): the manuscript provides no explicit description of filtering criteria used to select the 2,294 issues, no audit for train-test leakage with the evaluated models' pre-training data, and no statistical significance tests around the 1.96% figure. These omissions are load-bearing for the central empirical claim that models 'can resolve only the simplest issues.'

minor comments (2)

[Abstract] The abstract mentions the fine-tuned SWE-Llama model but supplies no training details, hyper-parameters, or comparative numbers; these should be added to the main text or a dedicated subsection.
[Figures/Tables] Figure and table captions should explicitly state the exact prompting template and output format used for each model to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Task formulation / evaluation protocol] Task formulation (abstract and evaluation protocol): the headline claim that Claude 2 solves only 1.96% of real-world issues rests on a single-shot, full-repo-plus-description setup that forbids search, test execution, feedback loops, or external tools. Real GitHub workflows routinely employ these operations; the protocol therefore introduces an artificial constraint whose effect on measured performance is not quantified, weakening the inference that current LMs are fundamentally limited rather than simply mismatched to the chosen interface.

Authors: We appreciate the referee's observation regarding the evaluation protocol. The single-shot, full-context setup was intentionally chosen to measure a model's ability to perform end-to-end reasoning and patch generation over an entire codebase given only an issue description, without relying on external tools or iterative feedback. This isolates the core challenge of multi-file code understanding and editing, which remains a prerequisite even for more advanced agentic systems. We do not claim this protocol fully replicates real-world developer workflows; rather, it establishes a challenging baseline that highlights limitations in current models' direct capabilities. We agree that the effect of adding search, execution, or feedback loops is not quantified here and would require a separate experimental design. We will revise the manuscript to more explicitly articulate the scope and motivation of the protocol, including its relation to real GitHub practices and potential extensions with tool use. revision: partial
Referee: [Dataset construction] Data curation and leakage analysis (dataset construction section): the manuscript provides no explicit description of filtering criteria used to select the 2,294 issues, no audit for train-test leakage with the evaluated models' pre-training data, and no statistical significance tests around the 1.96% figure. These omissions are load-bearing for the central empirical claim that models 'can resolve only the simplest issues.'

Authors: We thank the referee for noting these gaps in the dataset section. The curation process selected resolved GitHub issues paired with pull requests from 12 popular Python repositories, applying filters to ensure the issues involved meaningful code changes, were reproducible, and required edits across the repository. We will expand the manuscript with a detailed description of these criteria, including repository selection, issue filtering steps, and validation procedures. For train-test leakage, a complete audit is not possible for proprietary models such as Claude 2 due to undisclosed training data; we will add a limitations discussion noting the use of post-cutoff issues where feasible and the inherent constraints. Regarding statistical significance, we will include confidence intervals or binomial proportion tests around the reported success rates to better support the empirical claims. These changes will be incorporated in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on new benchmark

full rationale

The paper constructs SWE-bench from real GitHub issues and reports measured success rates (e.g., Claude 2 at 1.96%) via direct evaluation. No mathematical derivations, fitted parameters, predictions, or self-citations are used to derive results from inputs; performance numbers are obtained by running models on the introduced dataset without reduction to prior quantities or self-referential definitions. The central claim rests on fresh data collection and standard benchmarking, not on any load-bearing chain that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction paper with no mathematical derivations, fitted constants, or theoretical postulates. No free parameters, axioms, or invented entities are required or introduced beyond the benchmark dataset itself.

pith-pipeline@v0.9.0 · 5540 in / 1253 out tokens · 44626 ms · 2026-05-10T13:57:15.554594+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
The best-performing model, Claude 2, is able to solve a mere 1.96% of the issues.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation
cs.AI 2026-05 unverdicted novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis
quant-ph 2026-04 conditional novelty 8.0

StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
cs.CV 2026-04 unverdicted novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
cs.AI 2026-04 unverdicted novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
cs.AI 2024-08 unverdicted novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
State-Centric Decision Process
cs.AI 2026-05 unverdicted novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits
cs.SE 2026-05 accept novelty 7.0

CppPerf-Mine produces CppPerf-DB, a benchmark of 347 real-world performance-improving C++ patches (39% multi-file) from 42 repositories to evaluate repository-level repair tools.
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
cs.CL 2026-05 conditional novelty 7.0

LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 7.0

BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
cs.SE 2026-05 conditional novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
cs.CR 2026-05 unverdicted novelty 7.0

LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
Switchcraft: AI Model Router for Agentic Tool Calling
cs.AI 2026-05 unverdicted novelty 7.0

Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
cs.CR 2026-05 unverdicted novelty 7.0

MOSAIC-Bench demonstrates that nine production coding agents achieve 53-86% end-to-end attack success rates on staged innocuous tickets across 10 web substrates and 31 CWE classes, far higher than the 0-20.4% rates se...
ProgramBench: Can Language Models Rebuild Programs From Scratch?
cs.SE 2026-05 unverdicted novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program Repair
cs.SE 2026-05 unverdicted novelty 7.0

ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
cs.LG 2026-05 unverdicted novelty 7.0

The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
cs.LG 2026-05 unverdicted novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
cs.DC 2026-04 unverdicted novelty 7.0

CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
cs.SE 2026-04 unverdicted novelty 7.0

ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
cs.CV 2026-04 unverdicted novelty 7.0

ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
cs.SE 2026-04 conditional novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges
cs.AI 2026-04 unverdicted novelty 7.0

LLM agents reach only 35% average checkpoint completion on ten realistic CTF challenges in a new open benchmark with automated partial-credit scoring.
DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging
cs.SE 2026-04 unverdicted novelty 7.0

DebugRepair improves LLM-based automated program repair by adding test semantic purification, simulated instrumentation, and debugging-driven conversational repair, fixing 224 Defects4J bugs with GPT-3.5 (26.2% above ...
Choose Your Own Adventure: Non-Linear AI-Assisted Programming with EvoGraph
cs.HC 2026-04 unverdicted novelty 7.0

EvoGraph turns linear AI-assisted programming into a manipulable graph of branching histories, reducing cognitive load and enabling better iteration according to a user study with 20 developers.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
Feedback-Driven Execution for LLM-Based Binary Analysis
cs.CR 2026-04 unverdicted novelty 7.0

FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precis...
RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code
cs.CR 2026-04 unverdicted novelty 7.0

RealVuln benchmark finds security-specialized scanners outperform general-purpose LLMs and rule-based SAST tools on hand-labeled vulnerable Python code under F3 scoring, with all artifacts released.
CodeComp: Structural KV Cache Compression for Agentic Coding
cs.CL 2026-04 unverdicted novelty 7.0

CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context pa...
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
cs.AI 2026-04 unverdicted novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
cs.CL 2026-04 unverdicted novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding
cs.SE 2026-04 accept novelty 7.0

SADU benchmark shows top VLMs reach only 70% accuracy on software architecture diagram tasks, revealing gaps in visual reasoning for engineering artifacts.
Toward Executable Repository-Level Code Generation via Environment Alignment
cs.SE 2026-04 unverdicted novelty 7.0

EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
cs.AI 2026-04 unverdicted novelty 7.0

AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
cs.CL 2026-04 unverdicted novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
cond-mat.mtrl-sci 2026-04 conditional novelty 7.0

MatClaw is a code-first LLM agent that autonomously executes end-to-end materials workflows by generating and running Python scripts on remote clusters, achieving reliable code generation via memory architecture and R...
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits
cs.SE 2026-04 conditional novelty 7.0

AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage
cs.SE 2026-04 unverdicted novelty 7.0

REAP automatically curates production-derived benchmarks for AI coding agents via LLM classification and stability checks, producing the Harvest benchmark with model solve rates of 42.9-58.2%.
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
cs.AI 2025-06 unverdicted novelty 7.0

τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
cs.AI 2024-06 unverdicted novelty 7.0

τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
cs.SE 2023-05 accept novelty 7.0

EvalPlus augments HumanEval with 80x more tests via LLM and mutation strategies, exposing up to 28.9% more incorrect LLM-generated code and reversing some model performance rankings.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...

Reference graph

Works this paper leans on

137 extracted references · 137 canonical work pages · cited by 150 Pith papers · 2 internal anchors

[1]

Knowledge, Technology & Policy , year=

The cathedral and the bazaar , author=. Knowledge, Technology & Policy , year=

work page
[2]

Large sequence models for software development activities , author=

work page
[5]

Proceedings of the aaai conference on artificial intelligence , volume=

Deepfix: Fixing common c language errors by deep learning , author=. Proceedings of the aaai conference on artificial intelligence , volume=

work page
[6]

ACM Transactions on Software Engineering and Methodology (TOSEM) , volume=

Precise learn-to-rank fault localization using dynamic and static features of target programs , author=. ACM Transactions on Software Engineering and Methodology (TOSEM) , volume=. 2019 , publisher=

work page 2019
[9]

2023 , eprint=

AgentBench: Evaluating LLMs as Agents , author=. 2023 , eprint=

work page 2023
[10]

Nature Communications , year=

Mapping global dynamics of benchmark creation and saturation in artificial intelligence , author=. Nature Communications , year=

work page
[11]

2021 , eprint=

Dynabench: Rethinking Benchmarking in NLP , author=. 2021 , eprint=

work page 2021
[12]

2023 , eprint=

OctoPack: Instruction Tuning Code Large Language Models , author=. 2023 , eprint=

work page 2023
[13]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[14]

2022 , eprint=

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation , author=. 2022 , eprint=

work page 2022
[15]

n/a , pages =

Program Synthesis with Large Language Models , author =. n/a , pages =. 2021 , note =

work page 2021
[16]

2021 , eprint=

Measuring Coding Challenge Competence With APPS , author=. 2021 , eprint=

work page 2021
[17]

2023 , eprint=

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback , author=. 2023 , eprint=

work page 2023
[19]

2023 , eprint=

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models , author=. 2023 , eprint=

work page 2023
[20]

2022 , eprint=

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation , author=. 2022 , eprint=

work page 2022
[21]

2022 , eprint=

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , author=. 2022 , eprint=

work page 2022
[22]

2022 , eprint=

Automating Code Review Activities by Large-Scale Pre-training , author=. 2022 , eprint=

work page 2022
[23]

2021 , eprint=

CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model , author=. 2021 , eprint=

work page 2021
[24]

2023 , eprint=

CommitBART: A Large Pre-trained Model for GitHub Commits , author=. 2023 , eprint=

work page 2023
[26]

2023 , eprint=

Large Language Models Meet NL2Code: A Survey , author=. 2023 , eprint=

work page 2023
[27]

Proceedings of the 13th International Conference on Mining Software Repositories , Pages =

Yang, Xin and Kula, Raula Gaikovina and Yoshida, Norihiro and Iida, Hajimu , Title =. Proceedings of the 13th International Conference on Mining Software Repositories , Pages =

work page
[28]

2023 , eprint=

Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions , author=. 2023 , eprint=

work page 2023
[29]

2022 , eprint=

Program Repair , author=. 2022 , eprint=

work page 2022
[30]

Automatic Software Repair: A Survey , year=

Gazzola, Luca and Micucci, Daniela and Mariani, Leonardo , journal=. Automatic Software Repair: A Survey , year=

work page
[31]

Communications of the ACM , volume=

Automated program repair , author=. Communications of the ACM , volume=. 2019 , publisher=

work page 2019
[33]

Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

Less training, more repairing please: revisiting automated program repair via zero-shot learning , author=. Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

work page
[34]

2023 , eprint=

Conversational Automated Program Repair , author=. 2023 , eprint=

work page 2023
[35]

2023 , eprint=

Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction , author=. 2023 , eprint=

work page 2023
[36]

2023 , eprint=

Better Automatic Program Repair by Using Bug Reports and Tests Together , author=. 2023 , eprint=

work page 2023
[37]

2023 , eprint=

Automated Repair of Programs from Large Language Models , author=. 2023 , eprint=

work page 2023
[38]

2022 , eprint=

Natural Language to Code Generation in Interactive Data Science Notebooks , author=. 2022 , eprint=

work page 2022
[39]

2023 , eprint=

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation , author=. 2023 , eprint=

work page 2023
[40]

2021 , eprint=

On Multi-Modal Learning of Editing Source Code , author=. 2021 , eprint=

work page 2021
[41]

2022 , eprint=

CoditT5: Pretraining for Source Code and Natural Language Editing , author=. 2022 , eprint=

work page 2022
[42]

2021 , eprint=

Towards Automating Code Review Activities , author=. 2021 , eprint=

work page 2021
[44]

2023 , eprint=

AI Safety Subproblems for Software Engineering Researchers , author=. 2023 , eprint=

work page 2023
[45]

2023 , eprint=

Large Language Models for Software Engineering: A Systematic Literature Review , author=. 2023 , eprint=

work page 2023
[46]

ISSTA 2014, Proceedings of the 2014 International Symposium on Software Testing and Analysis , pages =

Ren. ISSTA 2014, Proceedings of the 2014 International Symposium on Software Testing and Analysis , pages =. 2014 , note =

work page 2014
[47]

Advances in Neural Information Processing Systems , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=

work page
[48]

Sequence Parallelism: Long Sequence Training from System Perspective , url =

Li, Shenggui and Xue, Fuzhao and Baranwal, Chaitanya and Li, Yongbin and You, Yang. Sequence Parallelism: Long Sequence Training from System Perspective. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.134

work page doi:10.18653/v1/2023.acl-long.134 2023
[49]

2023 , eprint=

Software Testing with Large Language Model: Survey, Landscape, and Vision , author=. 2023 , eprint=

work page 2023
[50]

2023 , eprint=

InCoder: A Generative Model for Code Infilling and Synthesis , author=. 2023 , eprint=

work page 2023
[51]

al , title =

Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco et. al , title =. CoRR , volume =

work page
[52]

2023 , eprint=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. 2023 , eprint=

work page 2023
[53]

2023 , eprint=

Mind2Web: Towards a Generalist Agent for the Web , author=. 2023 , eprint=

work page 2023
[54]

2022 , eprint=

Holistic Evaluation of Language Models , author=. 2022 , eprint=

work page 2022
[55]

2023 , eprint=

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. 2023 , eprint=

work page 2023
[56]

Foundations and Trends

The probabilistic relevance framework: BM25 and beyond , author=. Foundations and Trends. 2009 , publisher=

work page 2009
[57]

Nature Machine Intelligence , year=

Research community dynamics behind popular AI benchmarks , author=. Nature Machine Intelligence , year=

work page
[58]

2021 , eprint=

What Will it Take to Fix Benchmarking in Natural Language Understanding? , author=. 2021 , eprint=

work page 2021
[59]

2019 , eprint=

Language Tasks and Language Games: On Methodology in Current Natural Language Processing Research , author=. 2019 , eprint=

work page 2019
[61]

NeurIPS , year=

Measuring Coding Challenge Competence With APPS , author=. NeurIPS , year=

work page
[62]

2021 , eprint=

Program Synthesis with Large Language Models , author=. 2021 , eprint=

work page 2021
[63]

2022 , eprint=

CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation , author=. 2022 , eprint=

work page 2022
[64]

2023 , eprint=

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author=. 2023 , eprint=

work page 2023
[65]

2023 , eprint=

Measuring The Impact Of Programming Language Distribution , author=. 2023 , eprint=

work page 2023
[66]

2023 , eprint=

Multi-lingual Evaluation of Code Generation Models , author=. 2023 , eprint=

work page 2023
[67]

2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR) , year=

How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset , author=. 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR) , year=

work page 2020
[68]

2023 , eprint=

Code Llama: Open Foundation Models for Code , author=. 2023 , eprint=

work page 2023
[69]

Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022
[70]

2023 , eprint=

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models , author=. 2023 , eprint=

work page 2023
[71]

Liu and Kevin Lin and John Hewitt and Ashwin Paranjape and Michele Bevilacqua and Fabio Petroni and Percy Liang , title =

Nelson F. Liu and Kevin Lin and John Hewitt and Ashwin Paranjape and Michele Bevilacqua and Fabio Petroni and Percy Liang , title =

work page
[72]

Annotation Artifacts in Natural Language Inference Data

Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel and Smith, Noah A. Annotation Artifacts in Natural Language Inference Data. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. doi:10.186...

work page doi:10.18653/v1/n18-2017 2018
[73]

International Conference on Learning Representations , year=

The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations , year=

work page
[74]

2023 , eprint=

An Analysis of the Automatic Bug Fixing Performance of ChatGPT , author=. 2023 , eprint=

work page 2023
[75]

Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis , pages=

Shaping program repair space with existing patches and similar code , author=. Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis , pages=

work page
[76]

, journal=

McCabe, Thomas J. , journal=. A Complexity Measure , year=

work page
[77]

, title =

Halstead, Maurice H. , title =. 1977 , isbn =

work page 1977
[78]

Learning to represent programs with graphs

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740, 2017

work page arXiv 2017
[79]

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, and Yuchen Tian et. al. Multi-lingual evaluation of code generation models, 2023

work page 2023
[80]

Program synthesis with large language models, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021

work page 2021
[81]

Bowman and George E

Samuel R. Bowman and George E. Dahl. What will it take to fix benchmarking in natural language understanding?, 2021

work page 2021
[82]

Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022

work page 2022
[83]

On multi-modal learning of editing source code, 2021

Saikat Chakraborty and Baishakhi Ray. On multi-modal learning of editing source code, 2021

work page 2021
[84]

Chakraborty, Y

Saikat Chakraborty, Yujian Li, Matt Irvine, Ripon Saha, and Baishakhi Ray. Entropy guided spectrum based bug localization using statistical language model. arXiv preprint arXiv:1802.06947, 2018

work page arXiv 2018
[85]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, and Jared Kaplan et. al. Evaluating large language models trained on code, 2021

work page 2021
[86]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35: 0 16344--16359, 2022

work page 2022
[87]

Mind2web: Towards a generalist agent for the web, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023

work page 2023
[88]

Large language models of code fail at completing code with potential bugs

Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, and George Karypis. Large language models of code fail at completing code with potential bugs. arXiv preprint arXiv:2306.03438, 2023

work page arXiv 2023
[89]

Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation, 2023

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation, 2023

work page 2023

Showing first 80 references.