arxiv: 2504.21798 · v2 · submitted 2025-04-30 · 💻 cs.SE · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

SWE-smith: Scaling Data for Software Engineering Agents

John Yang , Kilian Lieret , Carlos E. Jimenez , Alexander Wettig , Kabir Khandpur , Yanzhe Zhang , Binyuan Hui , Ofir Press

show 2 more authors

Ludwig Schmidt Diyi Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords software engineering agentstraining data scalingSWE-smith pipelinetest-breaking tasksSWE-bench benchmarkopen source language modelsautomated data synthesisPython repositories

0 comments

The pith

SWE-smith automatically synthesizes 50k task instances from 128 Python repositories to train an open-source agent that resolves 40.2 percent of SWE-bench Verified issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SWE-smith as a pipeline that takes any Python codebase, builds a matching execution environment, and then generates hundreds or thousands of training examples consisting of natural-language instructions paired with edits that cause existing tests to fail. This removes the previous limits of small hand-curated datasets drawn from at most a dozen repositories and hundreds of hours of manual work. The resulting 50k-instance collection is used to fine-tune SWE-agent-LM-32B, which records a 40.2 percent Pass@1 resolve rate on the SWE-bench Verified benchmark. The authors release the full pipeline, the dataset, the trajectories, and the model weights so that others can reproduce or extend the approach without rebuilding the infrastructure from scratch.

Core claim

SWE-smith is a pipeline that, for any Python codebase, constructs an execution environment and then automatically synthesizes hundreds to thousands of task instances in which a proposed code change would break one or more existing tests. Running the pipeline across 128 repositories produces a training set of 50k instances. Fine-tuning SWE-agent-LM-32B on this data yields a 40.2 percent Pass@1 resolve rate on SWE-bench Verified, the highest score reported for any open-source model on that benchmark.

What carries the argument

The SWE-smith pipeline, which builds execution environments for arbitrary Python repositories and then synthesizes test-breaking task instances at scale.

If this is right

Training data for software engineering agents can now be produced at a scale an order of magnitude larger than before without proportional human labor.
Open-source models can reach performance levels previously seen only in closed systems on the SWE-bench Verified benchmark.
The open release of the collection procedure, task instances, trajectories, and model weights lowers the barrier for further research on LM-based software agents.
Execution environments and task synthesis can be repeated for additional repositories beyond the initial 128.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the synthesized tasks prove representative, the same pipeline could be extended to additional programming languages or to other software engineering subtasks such as test generation or refactoring.
The performance gain suggests that data volume may be a larger bottleneck than model architecture for current coding agents.
Combining the new dataset with reinforcement learning on execution trajectories could produce further gains without additional human annotation.

Load-bearing premise

The automatically synthesized instances in which code edits break tests are realistic and diverse enough that models trained on them will generalize to the real issues in SWE-bench Verified.

What would settle it

A controlled experiment in which expert developers rate a random sample of the generated tasks as unrealistic or low-quality at a high rate, or in which a model trained solely on the new data scores well below 40 percent on SWE-bench Verified while matching prior data mixtures.

read the original abstract

Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-smith scales data generation for SWE agents to 50k instances across 128 repos and hits 40.2% on SWE-bench Verified with an open model, but the realism of the auto-generated tasks still needs direct evidence.

read the letter

The main point is that this paper gives a working pipeline to automatically build execution environments for Python codebases and then synthesize hundreds to thousands of test-breaking task instances per repo. They used it to produce 50k instances from 128 repos, which is a clear jump from the prior limit of a few thousand instances from at most 11 repos, and training on that data gets their 32B open model to 40.2% Pass@1 on the external SWE-bench Verified benchmark.

Referee Report

3 major / 2 minor

Summary. The paper introduces SWE-smith, an automated pipeline that, given any Python codebase, builds an execution environment and synthesizes hundreds to thousands of task instances per repository by mutating code to break existing tests. Using this pipeline the authors generate a 50k-instance dataset drawn from 128 GitHub repositories—an order of magnitude larger than prior SWE training sets—and train SWE-agent-LM-32B, which reaches 40.2% Pass@1 on SWE-bench Verified, reported as state-of-the-art among open-source models. All assets (pipeline, instances, trajectories, and models) are released publicly.

Significance. If the synthetic instances are shown to be sufficiently diverse, non-trivial, and distributionally aligned with real human-reported bugs, the work would materially advance the field by removing the primary data-scarcity bottleneck for LM-based software-engineering agents. The reported scale (50k instances, 128 repositories) and the open release of the full pipeline constitute concrete, reusable contributions that could enable reproducible follow-on research.

major comments (3)

[§3] §3 (pipeline description): the synthesis procedure is presented without any quantitative characterization of the generated tasks—e.g., histograms or tables reporting the number of files/lines edited per instance, the distribution of bug types (single-function vs. multi-file, control-flow vs. data-flow), or overlap with SWE-bench issue categories. Because the central claim is that these automatically generated instances drive the 40.2% result, the absence of such metrics leaves open the possibility that most tasks are local pattern-matching problems rather than the multi-file reasoning SWE-bench demands.
[§4] §4 (experimental results): no ablation is reported that isolates the contribution of data scale versus data quality—e.g., performance of the same model trained on random subsets of 5k/10k instances, on instances filtered by human review, or on prior smaller curated datasets. Without these controls it is impossible to attribute the reported gain specifically to the SWE-smith pipeline rather than to model scale or training procedure.
[§3.2] §3.2 (instance validation): the manuscript states that instances are retained only when they break at least one test, yet provides no measured error rate for the synthesis process itself (false-positive “broken” tests caused by environment misconfiguration, flaky tests, or non-deterministic behavior). Such an error rate directly affects the reliability of the 50k-instance claim.

minor comments (2)

[Abstract] The abstract asserts “state of the art among open source models” but does not list the exact competing open-source models and their reported scores on SWE-bench Verified; a short comparison table would strengthen the claim.
[Figure 1] Figure 1 (pipeline overview) uses small font sizes and overlapping arrows that reduce readability; increasing figure size or simplifying the diagram would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We have revised the manuscript to address the major comments by adding quantitative dataset characterizations, scaling ablations, and validation error rates. Point-by-point responses follow.

read point-by-point responses

Referee: §3 (pipeline description): the synthesis procedure is presented without any quantitative characterization of the generated tasks—e.g., histograms or tables reporting the number of files/lines edited per instance, the distribution of bug types (single-function vs. multi-file, control-flow vs. data-flow), or overlap with SWE-bench issue categories. Because the central claim is that these automatically generated instances drive the 40.2% result, the absence of such metrics leaves open the possibility that most tasks are local pattern-matching problems rather than the multi-file reasoning SWE-bench demands.

Authors: We agree that quantitative characterization strengthens the central claim. In the revised manuscript we added Section 3.3 with histograms and summary tables for files/lines edited per instance, bug-type distributions derived from our mutation operators (showing >35% multi-file and non-local changes), and an overlap table with SWE-bench issue categories. These statistics indicate that a substantial fraction of instances require multi-file reasoning rather than local pattern matching. revision: yes
Referee: §4 (experimental results): no ablation is reported that isolates the contribution of data scale versus data quality—e.g., performance of the same model trained on random subsets of 5k/10k instances, on instances filtered by human review, or on prior smaller curated datasets. Without these controls it is impossible to attribute the reported gain specifically to the SWE-smith pipeline rather than to model scale or training procedure.

Authors: We have added scaling ablations in Section 4.2: performance of the identical 32B model trained on random 5k, 10k, 20k, and 50k subsets, plus direct comparison against the same model trained on prior smaller curated sets. These show consistent gains with scale. Full human review of 50k instances is impractical at this scale; we instead report a manual audit of a 1k random sample confirming high quality and rely on the automated test-failure filter. The new results help attribute gains to the pipeline. revision: partial
Referee: §3.2 (instance validation): the manuscript states that instances are retained only when they break at least one test, yet provides no measured error rate for the synthesis process itself (false-positive “broken” tests caused by environment misconfiguration, flaky tests, or non-deterministic behavior). Such an error rate directly affects the reliability of the 50k-instance claim.

Authors: We performed a validation study on 1,000 randomly sampled instances, re-executing each in fresh environments and running tests three times to detect flakiness. The measured false-positive rate is 3.1%, primarily from rare environment-setup issues in a handful of repositories (now mitigated in the released pipeline). We have added this measured error rate and the validation protocol to §3.2. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; evaluation uses external benchmark

full rationale

The paper's core chain is: apply SWE-smith pipeline to generate 50k synthetic task instances from 128 repositories, train SWE-agent-LM-32B on them, then measure Pass@1 resolve rate on the fixed external SWE-bench Verified benchmark. This evaluation target is defined independently of the generated training data and is not reduced to any fitted parameter or self-defined quantity by construction. No self-definitional equations, fitted inputs relabeled as predictions, or load-bearing self-citations that collapse the central performance claim back to the paper's own inputs appear in the provided text. The result is therefore self-contained against an external reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that automatically generated failing-test instances constitute effective training data for software engineering agents; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Automatically synthesized task instances that break existing tests in a codebase are valid and useful for training software engineering agents.
The pipeline's value depends on this assumption holding without detailed human validation or quality filtering described in the abstract.

pith-pipeline@v0.9.0 · 5569 in / 1398 out tokens · 49898 ms · 2026-05-15T10:16:25.188487+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
cs.LG 2026-05 conditional novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
cs.CL 2026-05 unverdicted novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.
ProgramBench: Can Language Models Rebuild Programs From Scratch?
cs.SE 2026-05 unverdicted novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
cs.SE 2026-04 unverdicted novelty 7.0

ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
Neurosymbolic Repo-level Code Localization
cs.SE 2026-04 unverdicted novelty 7.0

LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
Evaluating LLM Agents on Automated Software Analysis Tasks
cs.SE 2026-04 unverdicted novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its ...
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
cs.LG 2026-03 unverdicted novelty 7.0

A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Coding Agents Don't Know When to Act
cs.SE 2026-05 unverdicted novelty 6.0

Coding agents exhibit action bias by proposing undesirable changes on already-fixed issues 35-65% of the time, and explicit reproduction instructions only partially mitigate this while creating new abstention errors.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
cs.DC 2026-05 unverdicted novelty 6.0

ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 unverdicted novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
cs.CL 2026-04 unverdicted novelty 5.0

JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
GLM-5: from Vibe Coding to Agentic Engineering
cs.LG 2026-02 unverdicted novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
cs.CL 2025-12 unverdicted novelty 5.0

DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
LLM-Based Automated Diagnosis Of Integration Test Failures At Google
cs.SE 2026-04 unverdicted novelty 4.0

Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 20 Pith papers · 1 internal anchor

[1]

Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv

URL http://arxiv.org/abs/2412.21139. arXiv:2412.21139 [cs]. PyTorch. torchtune: Pytorch’s finetuning library, April 2024. URL https//github.com/ pytorch/torchtune. Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Ji...

work page arXiv 2024
[2]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

URL http://arxiv.org/abs/2404.07972. arXiv:2404.07972 [cs]. Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, and Carolyn Rose. Repost: Scalable repository-level coding environment construction with sandbox testing, 2025b. URL https://arxiv.org/abs/2503.07358. Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Occasionally, the README.md file may also contain installation instructions

A good place to start is to look for a CONTRIBUTING.[md|rst] file, which will often contain instructions on how to install the repository and any dependencies it may have. Occasionally, the README.md file may also contain installation instructions

work page
[4]

pip install -e

Usually, a repository may have setup.py or pyproject.toml files which can be used to install the package. pip install -e . is commonly used, although many packages will also require an additional specifier that installs development packages as well (e.g. pip install -e .[dev] )

work page
[5]

You can usually find tests in a tests/ or test/ directory

To check whether the repository was installed successfully, run tests and see if they pass. You can usually find tests in a tests/ or test/ directory. You can run tests using pytest or unittest, depending on the framework used by the repository

work page
[6]

Also, be mindful of Ubuntu system dependencies that may need to be installed via apt-get (e.g

Sometimes, you will need to install additional packages, often listed in a requirements.txt or environment.yml file. Also, be mindful of Ubuntu system dependencies that may need to be installed via apt-get (e.g. sudo apt-get install <package>). Once you are finished with installing the repository, run the submit command to submit your changes for review. ...

work page
[7]

The approach works in a codebase-agnostic manner

work page
[8]

The approach reliably yields usable task instances (meaning 1+ passing tests break)

work page
[9]

procedural modification

The approach is controllable; via each strategy’s parameters, we can affect the quantity and quality of the generated bugs. System prompt for generating bugs with an LM You are a software developer doing chaos monkey testing. Your job is to rewrite a function such that it introduces a logical bug that will break existing unit test(s) in a codebase. To thi...

work page 2025
[10]

The majority of these PRs reflect changes that remain present in the codebase today (making the bug creation promising)

work page
[11]

lines, file) of the relevant code changed because of other changes

However, many patches can not be reversed because the exact location (e.g. lines, file) of the relevant code changed because of other changes. Therefore, we employ LMs to perform patch reversal, and find that reasoning models (e.g. o3-mini (OpenAI, 2024b)) are particularly effective. Description of method. We follow SWE-bench’s methodology for crawling PR...

work page 2023
[12]

not just .md, .rst files)

It must edit 1 + code files (e.g. not just .md, .rst files)

work page
[13]

It must reference 1 + GitHub issues, which serves as the problem statement

work page
[14]

adjustable

It must edit 1 + testing related files (1+ files with a test-adjacent keyword in it). With this collection strategy and SWE-smith’s focus on training data, the second and third requirements are no longer necessary. If there is no associated issue, issue text can simply be generated. If the patch does not contain any testing related changes, this is tolera...

work page 1945
[15]

Demonstration: A realistic GitHub issue to mimic (included in the <demonstration> tag)

work page
[16]

Patch: A git diff output/PR changes that introduces a bug (included in the <patch> tag)

work page
[17]

Test output: The output of running the tests after the patch is applied (included in the <test output> tag)

work page
[18]

difficulty score

Test source code: Source code for one or more tests that failed (included in the <test source code> tag). Output: A realistic GitHub issue for the patch. Guidelines: - Mimic the style and structure of the demonstration issues. If the demonstration issues are not well structured, your output should also be not well structured. If the demonstrations use imp...

work page 2024
[19]

As a first step, it might be a good idea to find and read code relevant to the <pr description>

work page
[20]

Create a script to reproduce the error and execute it with ‘python<filename.py>‘ using the bash tool, to confirm the error

work page
[21]

Edit the source code of the repo to resolve the issue

work page
[22]

Rerun your reproduce script and confirm that the error is fixed!

work page
[23]

Figure 20: A copy of the prompt provided to an LM via SWE-agent informing the LM of the nature of the task, the task description itself, and several tips on how to proceed

Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. Figure 20: A copy of the prompt provided to an LM via SWE-agent informing the LM of the nature of the task, the task description itself, and several tips on how to proceed. We briefly review the distinctions. First, tool i...

work page 2025
[24]

Existing agent systems often rely on Python-specific tooling, effectively overfitting to the original SWE-bench (Yang et al., 2024b)

Provide a benchmark to evaluate model and agent performance across a variety of programming languages and application domains. Existing agent systems often rely on Python-specific tooling, effectively overfitting to the original SWE-bench (Yang et al., 2024b). Although SWE-bench Multimodal addresses this to some degree, its focus on visual inputs is a con...

work page
[25]

Remain fully compatible with SWE-bench, so current users can adopt it without changing infrastructure

work page
[26]

% Resolved

Keep the dataset small enough to run quickly. While concurrent work like Zan et al. (2025) provides more task instances in multiple languages, we purposely constrain the number of task instances so that the dataset is easy to run quickly. In §F.4, we briefly discuss how performance by existing state of the art methods for SWE- bench is markedly worse on S...

work page 2025
[27]

43 SWE-smith: Scaling Data for Software Engineering Agents 0 20 40 60 Step limit 0 10 20 30 40Avg

Strip any environment variable manipulation from the beginning of the command. 43 SWE-smith: Scaling Data for Software Engineering Agents 0 20 40 60 Step limit 0 10 20 30 40Avg. number of steps SWE-agent-LM-32b Claude 3.7 Sonnet Figure 24: The average step count depends strongly on the prescribed step limit. 0 20 40 60 Step limit 0.0 0.2 0.4 0.6SWE-bench ...

work page
[28]

When multiple commands are chained with && or semicolons, only consider the last command

work page
[29]

Because some commands have subcommands (e.g., git checkout), we apply several basic heuristics to determine whether to keep the first or the first two words

Remove all arguments. Because some commands have subcommands (e.g., git checkout), we apply several basic heuristics to determine whether to keep the first or the first two words. Repetitive actions. We determine the longest repetitive sequence of actions by determining the longest sequence of identical base commands within the agent actions. Note that th...

work page
[30]

Error conditions: If the agent terminates due to an error (environment errors, inability of the LM to correctly format its messages, etc.) or because it exceeded its maximum context window, we return the error or context category

work page
[31]

subcategories

Early termination: If the agent was terminated because of a step or cost limit, we return one of the stuck . . . subcategories. Note that the SWE-agent still attempts to extract a submission (list of changes/patch). We determine the subcategory based on which part of the workflow agentic loop was terminated: (a) If no source (i.e., non-test) file was modi...

work page
[32]

F .5.4 Mitigating repetitive actions As described in section 4, SWE-agent-LM-32B frequently shows highly repetitive actions for unresolved instances

Successful submission: If the agent terminated and submitted a solution natu- rally, we returnincorrect localization or incorrect edit, depending on whether the changes from the submitted patch included changes to all files from the SWE-bench gold patch. F .5.4 Mitigating repetitive actions As described in section 4, SWE-agent-LM-32B frequently shows high...

work page 2023