Dissecting model behavior through agent trajectories

Anoop Deoras; Gaurav Gupta; Jun Huan; Vatshank Chaturvedi

arxiv: 2606.17454 · v2 · pith:OAVPFXTVnew · submitted 2026-06-16 · 💻 cs.AI · cs.LG

Dissecting model behavior through agent trajectories

Gaurav Gupta , Vatshank Chaturvedi , Jun Huan , Anoop Deoras This is my paper

Pith reviewed 2026-06-27 01:25 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords agent trajectoriescode state-spacesintent-execution gapmodel behavior analysisautonomous problem solvingSSA harnessagent benchmarksphase transitions

0 comments

The pith

Representing agent trajectories in code state-spaces reveals model-level differences in problem-solving behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that AI agent performance depends on closing the intent-execution gap between model intent and harness execution. It introduces a simple customizable harness called SSA that reproduces or improves pass@1 results on benchmarks like SWE-Pro and Terminal-Bench-2 across model families. Analysis of 138k trajectories mapped into code state-spaces shows models differ in how they allocate effort, measured by edit frequency, testing activity, and phase transitions. These finer-grained patterns go beyond aggregate success rates to expose distinct autonomous problem-solving styles. The approach matters because it identifies model-specific behaviors that harness design can address.

Core claim

By representing agent trajectories in code state-spaces, models exhibit observable differences in problem-solving behavior through metrics such as edit frequency, testing activity, and phase-transitions, which indicate how individual models allocate effort across stages of autonomous problem solving even when pass@1 scores are comparable.

What carries the argument

Code state-space representation of agent trajectories, which encodes sequences of code states to quantify metrics like edit frequency and testing activity.

Load-bearing premise

The chosen code state-space representation and SSA harness do not systematically distort observed differences so that trajectories reflect model intent rather than harness artifacts.

What would settle it

Finding that different models produce statistically indistinguishable distributions of edit frequencies, testing activities, and phase-transitions when run in the same SSA harness on the same benchmarks.

Figures

Figures reproduced from arXiv: 2606.17454 by Anoop Deoras, Gaurav Gupta, Jun Huan, Vatshank Chaturvedi.

**Figure 1.** Figure 1: The intent-execution gap derails an agent. Each panel pairs the model’s intent, or its own reasoning with the execution, i.e., payload the harness received and feedback it provided in a closed-loop. The model sets to start with reading a file, but due to parsing issues in the decoded streams, the harness received an invalid tool name. Harness sends a generic and valid feedback of ‘tool name not found’. In … view at source ↗

**Figure 2.** Figure 2: SSA architecture. Tasks and model adapters feed a cyclic SimpleStrandAgent loop. The loop streams model output, dispatches tools, executes them in the environment, and appends tool results to conversation state. Hooks validate bounds and record the loop; termination extracts the final environment state and diff patch. Model interface. Adapters map model-provider APIs into a common event interface: assistan… view at source ↗

**Figure 3.** Figure 3: Reasoning nudges in SSA: a quantitative target for Claude variants and a flexible directive for other model families. 3.3 Aligning harness with the model’s tool use preferences Across agents, while tools function in exactly the same way, models tend to exhibit distinct preferences in how they invoke them. For example, GPT models prefer to update code by using an apply_patch command to splice in text from a… view at source ↗

**Figure 4.** Figure 4: Agent trajectory in code state space. Solution divergence D(t) evolution as agent traverses code state from the given initial state at t1 till it reaches solution space Si at step t4. Each panel compares the reconstructed live diff with the closest element in Si . Green lines are reproduced patch features and gray lines are still missing. The trace follows a staircase D = 1.0 → D ≈ 0.67 → D = 0.5 → D = 0 a… view at source ↗

**Figure 5.** Figure 5: Composition of activities is a model signature. Share of three activities (source editing, scratch testing, suite testing) vs normalized cycle position of agent trajectories between 0 and 1, averaged across full benchmark instances (5 runs per instance). For SWE-Bench-Verified (left), Opus 4.6 performs considerable scratch-testing from the beginning and edits peak in the middle, with heavy suite-testing to… view at source ↗

**Figure 6.** Figure 6: Phase composition surfaces model-specific schedules. Per-cycle share of the four R6 (see LLM judge rubrics in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Solution-distance curves expose different problem-solving styles between strong models. Mean D(t) vs. normalised cycle position on SWE-Bench-Verified, split by resolved (blue) and unresolved (red) trajectories; shaded bands are ±1σ run-to-run variability. Opus 4.6 descends sharply on resolved runs (around t = 0.5) and plateaus high on unresolved ones, cleanly separating progress from failure. Gemini 3.1 Pr… view at source ↗

**Figure 8.** Figure 8: Single-edit collapse. A single correct edit can move an agent directly from the initial state to the solution set. Faithful replay of the GPT-5.4 trajectory on astropy__astropy-12907 (SWE-Bench Verified) has only two relevant states: the initial repository state x(t1) before any edit (D=1) and the post-edit state x(t2) after the decisive edit at cycle position t ≈ 0.53 (D=0). Green tiles mark reference pat… view at source ↗

**Figure 9.** Figure 9: Fixed-reference staircase. Against a fixed reference, solution distance behaves like recall over patch features. The Opus 4.6 trajectory for django__django-11292 is scored against one 8-feature reference held constant across the four panels. Green tiles are reference features reproduced by the live repository state, and gray tiles are missing features. As the agent lands successive hunks, the matched count… view at source ↗

**Figure 10.** Figure 10: Max-over-modes view. The full metric tracks the nearest empirical solution mode, not a single fixed patch. The same Opus 4.6 run on django__django-11292 is re-rendered with the displayed reference chosen independently at each snapshot by the maximisation in Eq. 5. The trajectory descends 1 → 0.667 → 0.5 → 0 between cycles 0.273 and 0.292, then remains at zero through three later refinements. The best refe… view at source ↗

**Figure 11.** Figure 11: Found-it-then-lost-it. In this Gemini 3 Flash trajectory on django__django-14140, the first edit exactly reproduces a popular 12-feature rewrite in S˜ i (22 of ∼ 30 resolved runs, also the dataset gold), so D drops from 1 to 0 at cycle 0.15. The next edit reverts that rewrite and substitutes a multi-condition guard that overlaps only one reference feature, so D jumps back to 0.917 (rounded to 0.9); the up… view at source ↗

**Figure 12.** Figure 12: Revert as discovery. Gemini 3.1 Pro run on astropy-12907 (SWE-Bench-Verified), the agent first reaches the gold-mode fix, deliberately reverts it to test the unfixed baseline, and uncovers an pre-existing unrelated bug. Faithful replay records the productive backtrack (D : 1→0→1→ 0.5 → 0): the final resolved patch combines the original _cstack fix with a new np.roll(..., axis=0) fix. 28 [PITH_FULL_IMAGE:… view at source ↗

**Figure 13.** Figure 13: SWE-Bench-Verified: total output tokens vs. pass@1, one point per model. Output tokens [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗

**Figure 14.** Figure 14: Backtracking ∆D histograms (part 1 of 2): Anthropic Claude and OpenAI families. Each cell is one model, green = progress edits, red = backtracking edits. Annotation shows the per-model backtrack-edit share. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

**Figure 15.** Figure 15: Backtracking ∆D histograms (part 2 of 2). Gemini, Grok and Qwen families. Continued from [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗

**Figure 16.** Figure 16: Edit/test ratio per cycle (part 1 of 2): Anthropic Claude and OpenAI families. Each cell [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗

**Figure 17.** Figure 17: Edit/test ratio per cycle (part 2 of 2). Gemini, Grok and Qwen families. Continued from [PITH_FULL_IMAGE:figures/full_fig_p038_17.png] view at source ↗

**Figure 18.** Figure 18: Mean solution-distance curves D(t) (part 1 of 2): Anthropic Claude and OpenAI families. Blue = resolved trajectories, red = unresolved. Shaded bands show ±1σ run-to-run variability (∼5 runs/instance, averaged across instances). 39 [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗

**Figure 19.** Figure 19: Mean solution-distance curves (part 2 of 2): Gemini, Grok and Qwen families. Shaded [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗

**Figure 20.** Figure 20: Genuine tool-error rate per cycle (part 1 of 2): Anthropic Claude and OpenAI families. [PITH_FULL_IMAGE:figures/full_fig_p041_20.png] view at source ↗

**Figure 21.** Figure 21: Genuine tool-error rate per cycle (part 2 of 2): Gemini, Grok and Qwen families. Continued [PITH_FULL_IMAGE:figures/full_fig_p042_21.png] view at source ↗

**Figure 22.** Figure 22: Phase composition R6 (see Table 4 for LLM judge rubrics) explore [PITH_FULL_IMAGE:figures/full_fig_p043_22.png] view at source ↗

**Figure 23.** Figure 23: Phase composition (part 2 of 2): Gemini, Grok and Qwen families. Continued from [PITH_FULL_IMAGE:figures/full_fig_p044_23.png] view at source ↗

**Figure 24.** Figure 24: Tool-call distribution per model on SWE-Bench-Verified (part 1 of 2): Anthropic Claude [PITH_FULL_IMAGE:figures/full_fig_p045_24.png] view at source ↗

**Figure 25.** Figure 25: Tool-call distribution per model on SWE-Bench-Verified (part 2 of 2): Gemini, Grok and [PITH_FULL_IMAGE:figures/full_fig_p046_25.png] view at source ↗

**Figure 26.** Figure 26: reports the same token-vs-pass@1 Pareto view for SWE-Bench-Pro that [PITH_FULL_IMAGE:figures/full_fig_p047_26.png] view at source ↗

**Figure 27.** Figure 27: Backtracking ∆D histograms on SWE-Bench-Pro (part 1 of 2): Anthropic Claude and OpenAI families. Each cell is one model; green = progress edits, red = backtracking edits. Annotation shows the per-model backtrack-edit share. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_27.png] view at source ↗

**Figure 28.** Figure 28: Backtracking ∆D histograms on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and Qwen families. Continued from [PITH_FULL_IMAGE:figures/full_fig_p049_28.png] view at source ↗

**Figure 29.** Figure 29: Edit/test ratio per cycle on SWE-Bench-Pro (part 1 of 2): Anthropic Claude and OpenAI [PITH_FULL_IMAGE:figures/full_fig_p051_29.png] view at source ↗

**Figure 30.** Figure 30: Edit/test ratio per cycle on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and Qwen families. [PITH_FULL_IMAGE:figures/full_fig_p052_30.png] view at source ↗

**Figure 31.** Figure 31: Mean solution-distance curves D(t) on SWE-Bench-Pro (part 1 of 2): Anthropic Claude and OpenAI families. Blue = resolved trajectories, red = unresolved. Shaded bands show ±1σ run-to-run variability (∼5 runs/instance, averaged across instances). 53 [PITH_FULL_IMAGE:figures/full_fig_p053_31.png] view at source ↗

**Figure 32.** Figure 32: Mean solution-distance curves on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and Qwen [PITH_FULL_IMAGE:figures/full_fig_p054_32.png] view at source ↗

**Figure 33.** Figure 33: Genuine tool-error rate per cycle on SWE-Bench-Pro (part 1 of 2): Anthropic Claude and [PITH_FULL_IMAGE:figures/full_fig_p055_33.png] view at source ↗

**Figure 34.** Figure 34: Genuine tool-error rate per cycle on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and [PITH_FULL_IMAGE:figures/full_fig_p056_34.png] view at source ↗

**Figure 35.** Figure 35: Phase composition on SWE-Bench-Pro R6 (see Table 4 for LLM judge rubrics) explore [PITH_FULL_IMAGE:figures/full_fig_p057_35.png] view at source ↗

**Figure 36.** Figure 36: Phase composition on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and Qwen families. [PITH_FULL_IMAGE:figures/full_fig_p058_36.png] view at source ↗

**Figure 37.** Figure 37: Tool-call distribution per model on SWE-Bench-Pro (part 1 of 2): Anthropic Claude and [PITH_FULL_IMAGE:figures/full_fig_p059_37.png] view at source ↗

**Figure 38.** Figure 38: Tool-call distribution per model on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and [PITH_FULL_IMAGE:figures/full_fig_p060_38.png] view at source ↗

**Figure 39.** Figure 39: Terminal-Bench-2: total output tokens vs. pass@1, one point per model. Output tokens are [PITH_FULL_IMAGE:figures/full_fig_p062_39.png] view at source ↗

**Figure 40.** Figure 40: Phase composition on Terminal-Bench-2 R6 (see Table 4 for LLM judge rubrics) explore [PITH_FULL_IMAGE:figures/full_fig_p063_40.png] view at source ↗

**Figure 41.** Figure 41: Phase composition on Terminal-Bench-2 (part 2 of 2): Gemini, Grok and Qwen families. [PITH_FULL_IMAGE:figures/full_fig_p064_41.png] view at source ↗

**Figure 42.** Figure 42: Genuine tool-error rate per cycle on Terminal-Bench-2 (part 1 of 2): Anthropic Claude [PITH_FULL_IMAGE:figures/full_fig_p065_42.png] view at source ↗

**Figure 43.** Figure 43: Genuine tool-error rate per cycle on Terminal-Bench-2 (part 2 of 2): Gemini, Grok and [PITH_FULL_IMAGE:figures/full_fig_p066_43.png] view at source ↗

**Figure 44.** Figure 44: Edit/test ratio per cycle on Terminal-Bench-2 (part 1 of 2): Anthropic Claude and OpenAI [PITH_FULL_IMAGE:figures/full_fig_p067_44.png] view at source ↗

**Figure 45.** Figure 45: Edit/test ratio per cycle on Terminal-Bench-2 (part 2 of 2): Gemini, Grok and Qwen [PITH_FULL_IMAGE:figures/full_fig_p068_45.png] view at source ↗

**Figure 46.** Figure 46: Tool-call distribution per model on Terminal-Bench-2 (part 1 of 2): Anthropic Claude and [PITH_FULL_IMAGE:figures/full_fig_p069_46.png] view at source ↗

**Figure 47.** Figure 47: Tool-call distribution per model on Terminal-Bench-2 (part 2 of 2): Gemini, Grok and [PITH_FULL_IMAGE:figures/full_fig_p070_47.png] view at source ↗

**Figure 48.** Figure 48: SWE-Pro Pass@1 under Base and Sanitized containers. The shaded overlay is the [PITH_FULL_IMAGE:figures/full_fig_p072_48.png] view at source ↗

**Figure 49.** Figure 49: SWE-Bench-Pro Pass@1 for Qwen models under Base, Git Instructions, and Sanitized [PITH_FULL_IMAGE:figures/full_fig_p073_49.png] view at source ↗

**Figure 50.** Figure 50: A corrupted tool-name causes the model to abandon an intended read (gpt-oss-120b, sympy-13974 (SWE-Bench-Verified). The model sets out to read add.py around _eval_power. A stray <|channel|> is folded into the file_read recipient, producing file_read<|channel|>commentary. The harness rejects the call with invalid tool name pattern. Because the injected token is invisible to the model, it concludes that it … view at source ↗

**Figure 51.** Figure 51: Token-level view of the corrupt-name example in [PITH_FULL_IMAGE:figures/full_fig_p106_51.png] view at source ↗

read the original abstract

AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we reproduce or improve on the pass@1 performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an analysis of 138k trajectories generated by SSA, we look beyond the pass@1 numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows model differences in edit frequency and phase transitions via 138k trajectories in one harness's code state-space, but that single-harness setup makes it unclear if the patterns are model-intrinsic or SSA artifacts.

read the letter

The main point is that they ran several frontier models through their Simple Strands Agent harness on SWE benchmarks, collected 138k trajectories, mapped them into a code state-space, and tracked metrics like edit frequency, testing activity, and phase transitions. This surfaces behavioral differences even when pass@1 scores are comparable across Claude, Gemini, GPT, Grok, and Qwen.

They reproduce or improve the reported pass@1 numbers on SWE-Pro, SWE-Verified, and Terminal-Bench-2. That part is straightforward and useful for confirming the harness works at all. The trajectory scale is large enough to make the observational claims worth looking at, and the state-space framing plus the specific metrics give a concrete way to inspect how models allocate effort during autonomous coding tasks.

The soft spot is the single-harness design. All trajectories come from the same SSA setup with its fixed state representation, loop structure, and prompting. Since pass@1 is roughly even, any systematic interaction between model family and that particular harness could generate the reported metric gaps without those gaps reflecting general problem-solving strategies. The abstract gives no cross-harness checks or variation in the state encoding, and it is light on sampling details or multiple-testing corrections for the metric differences.

This is for people who build and debug agent harnesses and want diagnostics finer than success rate. A reader working on agent evaluation would pick up usable metric ideas, but the evidence that the differences are model-level rather than harness-coupled is not yet solid. It deserves peer review if the authors add harness-variation controls and clearer methods, otherwise the central claim stays provisional.

Referee Report

1 major / 1 minor

Summary. The paper formalizes the 'intent-execution' gap between model intent and harness execution in AI agents. It introduces the Simple Strands Agent (SSA) harness, reports reproducing or improving pass@1 on SWE-Pro, SWE-Verified, and Terminal-Bench-2 across model families (Claude, Gemini, GPT, Grok, Qwen), and analyzes 138k trajectories represented in code state-spaces to identify model-level differences in finer-grained behaviors including edit frequency, testing activity, and phase transitions.

Significance. If the reported behavioral differences prove robust beyond the specific SSA harness, the work would offer a concrete methodology for moving past aggregate pass@1 metrics to understand how models allocate effort across problem-solving stages, directly supporting harness-model alignment research.

major comments (1)

[Trajectory analysis and experimental setup] The central claim that model-level differences in edit frequency, testing activity, and phase-transitions are observable via code state-spaces rests on trajectories generated exclusively by the single SSA harness (fixed state representation, loop structure, and prompting). Given the paper's own observation of roughly comparable pass@1 across families, systematic harness-model coupling could produce the divergences without reflecting intrinsic strategies; no cross-harness ablation or variation control is described to isolate model effects.

minor comments (1)

[Methods / Results] The abstract provides no detail on trajectory sampling procedure, precise definition of the code state-spaces, or application of multiple-testing correction to the reported metric differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We address the single major comment below.

read point-by-point responses

Referee: [Trajectory analysis and experimental setup] The central claim that model-level differences in edit frequency, testing activity, and phase-transitions are observable via code state-spaces rests on trajectories generated exclusively by the single SSA harness (fixed state representation, loop structure, and prompting). Given the paper's own observation of roughly comparable pass@1 across families, systematic harness-model coupling could produce the divergences without reflecting intrinsic strategies; no cross-harness ablation or variation control is described to isolate model effects.

Authors: We agree this is a valid concern. The 138k trajectories were generated exclusively with the SSA harness, and no cross-harness ablations or controlled variations in state representation, loop structure, or prompting were performed. SSA was designed as a minimal, general-purpose harness to reduce the intent-execution gap and enable consistent comparison across model families, with the similar pass@1 scores providing some evidence of harness parity. Nevertheless, the possibility of harness-model interactions cannot be ruled out from the current data alone. We will revise the manuscript to explicitly acknowledge this limitation in the experimental setup and trajectory analysis sections and to identify cross-harness validation as an important direction for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical trajectory analysis is self-contained

full rationale

The paper reports new experimental runs of 138k trajectories on public benchmarks using the SSA harness, followed by direct observation of metrics such as edit frequency and phase transitions. No equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on fresh data collection and comparison across models rather than any reduction of outputs to inputs by construction, satisfying the default expectation of non-circularity for empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical paper; no mathematical axioms or invented entities. The central claims rest on the design choices of the SSA harness and the definition of the trajectory state-space, which function as domain assumptions rather than derived quantities.

pith-pipeline@v0.9.1-grok · 5811 in / 1196 out tokens · 28555 ms · 2026-06-27T01:25:05.359186+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 3 canonical work pages

[1]

The Amazon Nova family of foundation models

Amazon AGI. The Amazon Nova family of foundation models. https://aws.amazon.com/ nova/, 2024

2024
[2]

A general path-based representation for predicting program properties, 2018

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. A general path-based representation for predicting program properties, 2018. URLhttps://arxiv.org/abs/1803.09544. 14 Dissecting model behavior through agent trajectories

Pith/arXiv arXiv 2018
[3]

The Claude 4 model family: System cards and capability notes

Anthropic. The Claude 4 model family: System cards and capability notes. https://www. anthropic.com/claude, 2025

2025
[4]

Strands Agents: A model-driven SDK for building AI agents.https://strandsagents

AWS. Strands Agents: A model-driven SDK for building AI agents.https://strandsagents. com, 2025. Open-source SDK

2025
[5]

Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro

Earl T. Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. The plastic surgery hypothesis. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, page 306–317, New York, NY , USA, 2014. Association for Computing Machinery. ISBN 9781450330565. doi: 10.1145/2635868.2635898. URLhttps...

work page doi:10.1145/2635868.2635898 2014
[6]

LiteLLM: A unified gateway and proxy for LLM APIs

BerriAI. LiteLLM: A unified gateway and proxy for LLM APIs. https://github.com/ BerriAI/litellm, 2023. Open-source library

2023
[7]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005. 14165

2020
[8]

Evaluating large language models trained on code, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, et al. Evaluating large language models trained on code, 2021. URL https: //arxiv.org/abs/2107.03374

Pith/arXiv arXiv 2021
[9]

Gemini 3: Pro and Flash

Google DeepMind. Gemini 3: Pro and Flash. https://deepmind.google/technologies/ gemini/, 2025

2025
[10]

Deepseek-v3 technical report, 2025

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, et al. Deepseek-v3 technical report, 2025. URLhttps://arxiv.org/abs/2412.19437

Pith/arXiv arXiv 2025
[11]

SWE- Bench Pro: Can AI agents solve long-horizon software engineering tasks?, 2025

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, et al. SWE- Bench Pro: Can AI agents solve long-horizon software engineering tasks?, 2025. URL https: //arxiv.org/abs/2509.16941

Pith/arXiv arXiv 2025
[12]

Agentic RL: Token-in, token-out done right, 2026

Quentin Gallouédec and Kashif Rasul. Agentic RL: Token-in, token-out done right, 2026. Accessed 2026-06-07

2026
[13]

GraphCodeBERT: Pre-training code representations with data flow, 2021

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, et al. GraphCodeBERT: Pre-training code representations with data flow, 2021. URLhttps://arxiv.org/abs/2009. 08366

2021
[14]

Deepseek- r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638,

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, et al. Deepseek- r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638,
[15]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume =

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10. 1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z
[16]

Qwen2.5-coder technical report, 2024

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, et al. Qwen2.5-coder technical report, 2024. URLhttps://arxiv.org/abs/2409.12186

Pith/arXiv arXiv 2024
[17]

LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code, 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code, 2024. URL https://arxiv.org/ abs/2403.07974

Pith/arXiv arXiv 2024
[18]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URLhttps://arxiv.org/abs/2310.06770

Pith/arXiv arXiv 2024
[19]

CODESTRUCT: Code agents over structured action spaces, 2026

Myeongsoo Kim, Joe Hsu, Dingmin Wang, Shweta Garg, Varun Kumar, and Murali Krishna Ramanathan. CODESTRUCT: Code agents over structured action spaces, 2026. URL https: //arxiv.org/abs/2604.05407

Pith/arXiv arXiv 2026
[20]

Large language models are zero-shot reasoners, 2023

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023. URL https://arxiv.org/abs/2205.11916. 15 Dissecting model behavior through agent trajectories

Pith/arXiv arXiv 2023
[21]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention, 2023. URL https://arxiv.org/abs/2309. 06180

2023
[22]

A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each

Claire Le Goues, Michael Dewey-V ogt, Stephanie Forrest, and Westley Weimer. A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each. InProceedings of the 34th International Conference on Software Engineering, ICSE ’12, page 3–13. IEEE Press,
[23]

Automated program repair

Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. Automated program repair. Commun. ACM, 62(12):56–65, November 2019. ISSN 0001-0782. doi: 10.1145/3318162. URL https://doi.org/10.1145/3318162

work page doi:10.1145/3318162 2019
[24]

Repobench: Benchmarking repository-level code auto-completion systems, 2023

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2023. URLhttps://arxiv.org/abs/2306.03091

Pith/arXiv arXiv 2023
[25]

Agentbench: Evaluating llms as agents, 2025

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, et al. Agentbench: Evaluating llms as agents, 2025. URLhttps://arxiv.org/abs/2308.03688

Pith/arXiv arXiv 2025
[26]

An analysis of the search spaces for generate and validate patch generation systems, 2016

Fan Long and Martin Rinard. An analysis of the search spaces for generate and validate patch generation systems, 2016. URLhttps://arxiv.org/abs/1602.05643

Pith/arXiv arXiv 2016
[27]

GPT-4.1 prompting guide

Noah MacCallum and Julian Lee. GPT-4.1 prompting guide. https://cookbook.openai. com/examples/gpt4-1_prompting_guide/, 2025. OpenAI Cookbook

2025
[28]

Data contamination: From memorization to exploitation, 2022

Inbal Magar and Roy Schwartz. Data contamination: From memorization to exploitation, 2022. URLhttps://arxiv.org/abs/2203.08242

arXiv 2022
[29]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URLhttps://arxiv.org/abs/2601.11868

Pith/arXiv arXiv 2026
[30]

MiniMax-M2: A foundation model with extended context and tool use

MiniMax AI. MiniMax-M2: A foundation model with extended context and tool use. https: //huggingface.co/MiniMaxAI, 2025

2025
[31]

Alberto Moraglio, Krzysztof Krawiec, and Colin G. Johnson. Geometric semantic genetic programming. InParallel Problem Solving from Nature - PPSN XII, pages 21–31, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-32937-1

2012
[32]

The GPT-5 model family and the gpt-oss open-weights release

OpenAI. The GPT-5 model family and the gpt-oss open-weights release. https://openai. com/, 2025

2025
[33]

openai-harmony: Response format for the gpt-oss models

OpenAI. openai-harmony: Response format for the gpt-oss models. https://github.com/ openai/harmony, 2025. Accessed 2026-06-07

2025
[34]

ToolLLM: Facilitating large language models to master 16000+ real-world apis, 2023

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, et al. ToolLLM: Facilitating large language models to master 16000+ real-world apis, 2023. URLhttps://arxiv.org/abs/2307.16789

Pith/arXiv arXiv 2023
[35]

Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025

Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, et al. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025. URL https://arxiv.org/abs/2504. 08703

2025
[36]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023. URLhttps://arxiv.org/abs/2310.18018

arXiv 2023
[37]

Toolformer: Language models can teach themselves to use tools, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URLhttps://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023
[38]

Building effective agents

Erik Schluntz and Barry Zhang. Building effective agents. https://www.anthropic.com/ research/building-effective-agents, 2024. Anthropic engineering blog. 16 Dissecting model behavior through agent trajectories

2024
[39]

Reflexion: Language agents with verbal reinforcement learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303.11366

Pith/arXiv arXiv 2023
[40]

Gradient-based program repair: Fixing bugs in continuous program spaces, 2026

André Silva, Gustav Thorén, and Martin Monperrus. Gradient-based program repair: Fixing bugs in continuous program spaces, 2026. URLhttps://arxiv.org/abs/2505.17703

arXiv 2026
[41]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024. URLhttps://arxiv.org/abs/2308.11432

Pith/arXiv arXiv 2024
[42]

Executable code actions elicit better LLM agents.arXiv preprint arXiv:2402.01030, 2024

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents.arXiv preprint arXiv:2402.01030, 2024. URL https://arxiv.org/abs/2402.01030

arXiv 2024
[43]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, et al

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, et al. Openhands: An open platform for ai software developers as generalist agents, 2025. URLhttps://arxiv.org/abs/2407.16741

Pith/arXiv arXiv 2025
[44]

Emergent abilities of large language models, 2022

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, et al. Emergent abilities of large language models, 2022. URL https://arxiv.org/abs/2206.07682

Pith/arXiv arXiv 2022
[45]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URLhttps://arxiv.org/abs/2201.11903

Pith/arXiv arXiv 2023
[46]

AutoGen: Enabling next-gen llm applications via multi-agent conversation, 2023

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, et al. AutoGen: Enabling next-gen llm applications via multi-agent conversation, 2023. URL https://arxiv. org/abs/2308.08155

Pith/arXiv arXiv 2023
[47]

Grok 4.20 reasoning model.https://x.ai/, 2025

xAI. Grok 4.20 reasoning model.https://x.ai/, 2025

2025
[48]

The rise and potential of large language model based agents: A survey, 2023

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, et al. The rise and potential of large language model based agents: A survey, 2023. URL https://arxiv.org/ abs/2309.07864

Pith/arXiv arXiv 2023
[49]

Agentless: Demystifying llm-based software engineering agents, 2024

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. URLhttps://arxiv.org/abs/2407.01489

Pith/arXiv arXiv 2024
[50]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/2404.07972

Pith/arXiv arXiv 2024
[51]

Hydra – a framework for elegantly configuring complex applications

Omry Yadan. Hydra – a framework for elegantly configuring complex applications. https: //github.com/facebookresearch/hydra, 2019. GitHub repository

2019
[52]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025
[53]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024
[54]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024. URLhttps://arxiv.org/abs/2410.03859

arXiv 2024
[55]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https: //arxiv.org/abs/2210.03629. 17 Dissecting model behavior through agent trajectories

Pith/arXiv arXiv 2023
[56]

Autocoderover: Au- tonomous program improvement, 2024

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Au- tonomous program improvement, 2024. URLhttps://arxiv.org/abs/2404.05427

arXiv 2024
[57]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

Pith/arXiv arXiv 2023
[58]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307. 13854. 18 Dissecting model behavior through agent trajectories Appendix Contents The appendi...

2024
[59]

no match

Forced tool-use output.The judge cannot reply with free-form text. Its only allowed output is a single call to thesubmit_classifications tool whose schema only accepts anenum. Outputs are validated client-side before acceptance. A.1.2 Classification rubric Table 4 lists every field the judge assigns to a call. R1–R5 carry the bulk of the behavioural signa...
[60]

fraction-of-fix-achieved

Textual overlap, not semantics.The recall fraction in Eq. 6 counts matching changed lines. A correct-but-textually-different fix, i.e., a different identifier choice, a refactored expression, a guard placed elsewhere scores <1 against reference modes it does not textually match. The empirical subsets and the self-anchor for resolved endpoints mitigate thi...
[61]

Empirical subset ̸= solution space.The space is defined by the test oracle and ˜Si approximates it with the patches we happened to observe (using 21 models run 5 times). A unique correct solution found by nobody else, a real possibility on novel instances, can be far from the observed empirical modes, which is why the self-anchor is reserved for resolved ...
[62]

˜Si is dense on easy instances (for example, in SWE-Bench-Verified80+ reference modes) and sparse on hard ones

Reference-set sparsity scales with difficulty. ˜Si is dense on easy instances (for example, in SWE-Bench-Verified80+ reference modes) and sparse on hard ones. Instances that no model in our sweep solved have only the gold patch as a reference (or no reference if gold is also missing). Don these instances reads against a thin reference set and should be in...
[63]

Judge R#

Replay fidelity. d(t) is reconstructed from the edit-tool calls we parse (Table 5). Therefore, exotic shell rewrites (e.g. Python scripts that open a source file in write mode) are not fully parsed. Self-anchor fixes the resolved endpoint, but it cannot recover the exact intermediate state of an unparsed edit, so a small number of trajectories’ mid-run sh...

arXiv 2025
[68]

User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs)

Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs). Consider the following PR description: <pr_description> {{git_issue}}...
[73]

User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a code repository in the directory {{project_path}} (not in /tmp/inputs)

Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a code repository in the directory {{project_path}} (not in /tmp/inputs). Consider the following PR description: <pr_description> {{git_issue}} </pr_d...
[75]

Create a script to reproduce the error and execute it using the BashTool, to confirm the error
[78]

If any test fails, diagnose the failure and fix your implementation

Run the existing test suite for the affected module. If any test fails, diagnose the failure and fix your implementation
[79]

IMPORTANT rules: - You MUST run the tests frequently, and verify correctness of changes by running relevant tests

Think about edgecases and make sure your fix handles them as well. IMPORTANT rules: - You MUST run the tests frequently, and verify correctness of changes by running relevant tests. If tests fail, analyze failures and revise your patch. - Failing to test sufficiently rigorously is the NUMBER ONE failure mode. - There are hidden tests beyond what is visibl...
[87]

- If any test fails, diagnose the failure, revise your fix, and rerun until they all pass

Find and run the repository’s own existing tests for the files and functions you modified (e.g., ‘pytest path/to/test_file.py‘, the project’s tests, etc.). - If any test fails, diagnose the failure, revise your fix, and rerun until they all pass. - Do not finish until the relevant repo tests pass. Your thinking should be thorough and so it’s fine if it’s ...
[88]

Before exploring anything, use the ‘think‘ tool to write up: - the task restated in your own words - 3-5 hypotheses for the root cause, ranked by likelihood
[89]

Explore the repo to familiarize yourself with its structure
[91]

Use the ‘think‘ tool to list 2-3 candidate fixes in 1-2 lines each, then pick the simplest one
[93]

Rerun your reproduce script and confirm that the error is fixed
[94]

Use the ‘think‘ tool to enumerate 3-5 edge cases for the changed code, then exercise each via the reproduction script or shell
[95]

- If any test fails, diagnose the failure, revise your fix, and rerun until they all pass

Find and run the repository’s own existing tests for the files and functions you modified (e.g., ‘pytest path/to/test_file.py‘, the project’s tests, etc.). - If any test fails, diagnose the failure, revise your fix, and rerun until they all pass. - Do not finish until the relevant repo tests pass. Your thinking should be thorough and so it’s fine if it’s ...
[100]

ideally more than 100 times

Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. In this environment, you can run ‘<apply_patch_command>‘ to execute a diff/patch against a file, where <apply_patch_command> is a specially formatted apply patch command representing the diff you wish to execute. A valid <...
[102]

Create a script to reproduce the error and execute it with ‘python <filename.py>‘ using the BashTool, to confirm the error
[103]

Edit the sourcecode of the repo to resolve the issue
[104]

Rerun your reproduce script and confirm that the error is fixed!
[105]

User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs)

Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs). Consider the following PR description: <pr_description> {{git_issue}}...

Showing first 80 references.

[1] [1]

The Amazon Nova family of foundation models

Amazon AGI. The Amazon Nova family of foundation models. https://aws.amazon.com/ nova/, 2024

2024

[2] [2]

A general path-based representation for predicting program properties, 2018

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. A general path-based representation for predicting program properties, 2018. URLhttps://arxiv.org/abs/1803.09544. 14 Dissecting model behavior through agent trajectories

Pith/arXiv arXiv 2018

[3] [3]

The Claude 4 model family: System cards and capability notes

Anthropic. The Claude 4 model family: System cards and capability notes. https://www. anthropic.com/claude, 2025

2025

[4] [4]

Strands Agents: A model-driven SDK for building AI agents.https://strandsagents

AWS. Strands Agents: A model-driven SDK for building AI agents.https://strandsagents. com, 2025. Open-source SDK

2025

[5] [5]

Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro

Earl T. Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. The plastic surgery hypothesis. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, page 306–317, New York, NY , USA, 2014. Association for Computing Machinery. ISBN 9781450330565. doi: 10.1145/2635868.2635898. URLhttps...

work page doi:10.1145/2635868.2635898 2014

[6] [6]

LiteLLM: A unified gateway and proxy for LLM APIs

BerriAI. LiteLLM: A unified gateway and proxy for LLM APIs. https://github.com/ BerriAI/litellm, 2023. Open-source library

2023

[7] [7]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005. 14165

2020

[8] [8]

Evaluating large language models trained on code, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, et al. Evaluating large language models trained on code, 2021. URL https: //arxiv.org/abs/2107.03374

Pith/arXiv arXiv 2021

[9] [9]

Gemini 3: Pro and Flash

Google DeepMind. Gemini 3: Pro and Flash. https://deepmind.google/technologies/ gemini/, 2025

2025

[10] [10]

Deepseek-v3 technical report, 2025

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, et al. Deepseek-v3 technical report, 2025. URLhttps://arxiv.org/abs/2412.19437

Pith/arXiv arXiv 2025

[11] [11]

SWE- Bench Pro: Can AI agents solve long-horizon software engineering tasks?, 2025

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, et al. SWE- Bench Pro: Can AI agents solve long-horizon software engineering tasks?, 2025. URL https: //arxiv.org/abs/2509.16941

Pith/arXiv arXiv 2025

[12] [12]

Agentic RL: Token-in, token-out done right, 2026

Quentin Gallouédec and Kashif Rasul. Agentic RL: Token-in, token-out done right, 2026. Accessed 2026-06-07

2026

[13] [13]

GraphCodeBERT: Pre-training code representations with data flow, 2021

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, et al. GraphCodeBERT: Pre-training code representations with data flow, 2021. URLhttps://arxiv.org/abs/2009. 08366

2021

[14] [14]

Deepseek- r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638,

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, et al. Deepseek- r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638,

[15] [15]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume =

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10. 1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z

[16] [16]

Qwen2.5-coder technical report, 2024

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, et al. Qwen2.5-coder technical report, 2024. URLhttps://arxiv.org/abs/2409.12186

Pith/arXiv arXiv 2024

[17] [17]

LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code, 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code, 2024. URL https://arxiv.org/ abs/2403.07974

Pith/arXiv arXiv 2024

[18] [18]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URLhttps://arxiv.org/abs/2310.06770

Pith/arXiv arXiv 2024

[19] [19]

CODESTRUCT: Code agents over structured action spaces, 2026

Myeongsoo Kim, Joe Hsu, Dingmin Wang, Shweta Garg, Varun Kumar, and Murali Krishna Ramanathan. CODESTRUCT: Code agents over structured action spaces, 2026. URL https: //arxiv.org/abs/2604.05407

Pith/arXiv arXiv 2026

[20] [20]

Large language models are zero-shot reasoners, 2023

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023. URL https://arxiv.org/abs/2205.11916. 15 Dissecting model behavior through agent trajectories

Pith/arXiv arXiv 2023

[21] [21]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention, 2023. URL https://arxiv.org/abs/2309. 06180

2023

[22] [22]

A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each

Claire Le Goues, Michael Dewey-V ogt, Stephanie Forrest, and Westley Weimer. A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each. InProceedings of the 34th International Conference on Software Engineering, ICSE ’12, page 3–13. IEEE Press,

[23] [23]

Automated program repair

Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. Automated program repair. Commun. ACM, 62(12):56–65, November 2019. ISSN 0001-0782. doi: 10.1145/3318162. URL https://doi.org/10.1145/3318162

work page doi:10.1145/3318162 2019

[24] [24]

Repobench: Benchmarking repository-level code auto-completion systems, 2023

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2023. URLhttps://arxiv.org/abs/2306.03091

Pith/arXiv arXiv 2023

[25] [25]

Agentbench: Evaluating llms as agents, 2025

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, et al. Agentbench: Evaluating llms as agents, 2025. URLhttps://arxiv.org/abs/2308.03688

Pith/arXiv arXiv 2025

[26] [26]

An analysis of the search spaces for generate and validate patch generation systems, 2016

Fan Long and Martin Rinard. An analysis of the search spaces for generate and validate patch generation systems, 2016. URLhttps://arxiv.org/abs/1602.05643

Pith/arXiv arXiv 2016

[27] [27]

GPT-4.1 prompting guide

Noah MacCallum and Julian Lee. GPT-4.1 prompting guide. https://cookbook.openai. com/examples/gpt4-1_prompting_guide/, 2025. OpenAI Cookbook

2025

[28] [28]

Data contamination: From memorization to exploitation, 2022

Inbal Magar and Roy Schwartz. Data contamination: From memorization to exploitation, 2022. URLhttps://arxiv.org/abs/2203.08242

arXiv 2022

[29] [29]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URLhttps://arxiv.org/abs/2601.11868

Pith/arXiv arXiv 2026

[30] [30]

MiniMax-M2: A foundation model with extended context and tool use

MiniMax AI. MiniMax-M2: A foundation model with extended context and tool use. https: //huggingface.co/MiniMaxAI, 2025

2025

[31] [31]

Alberto Moraglio, Krzysztof Krawiec, and Colin G. Johnson. Geometric semantic genetic programming. InParallel Problem Solving from Nature - PPSN XII, pages 21–31, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-32937-1

2012

[32] [32]

The GPT-5 model family and the gpt-oss open-weights release

OpenAI. The GPT-5 model family and the gpt-oss open-weights release. https://openai. com/, 2025

2025

[33] [33]

openai-harmony: Response format for the gpt-oss models

OpenAI. openai-harmony: Response format for the gpt-oss models. https://github.com/ openai/harmony, 2025. Accessed 2026-06-07

2025

[34] [34]

ToolLLM: Facilitating large language models to master 16000+ real-world apis, 2023

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, et al. ToolLLM: Facilitating large language models to master 16000+ real-world apis, 2023. URLhttps://arxiv.org/abs/2307.16789

Pith/arXiv arXiv 2023

[35] [35]

Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025

Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, et al. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025. URL https://arxiv.org/abs/2504. 08703

2025

[36] [36]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023. URLhttps://arxiv.org/abs/2310.18018

arXiv 2023

[37] [37]

Toolformer: Language models can teach themselves to use tools, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URLhttps://arxiv.org/abs/2302.04761

Pith/arXiv arXiv 2023

[38] [38]

Building effective agents

Erik Schluntz and Barry Zhang. Building effective agents. https://www.anthropic.com/ research/building-effective-agents, 2024. Anthropic engineering blog. 16 Dissecting model behavior through agent trajectories

2024

[39] [39]

Reflexion: Language agents with verbal reinforcement learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303.11366

Pith/arXiv arXiv 2023

[40] [40]

Gradient-based program repair: Fixing bugs in continuous program spaces, 2026

André Silva, Gustav Thorén, and Martin Monperrus. Gradient-based program repair: Fixing bugs in continuous program spaces, 2026. URLhttps://arxiv.org/abs/2505.17703

arXiv 2026

[41] [41]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024. URLhttps://arxiv.org/abs/2308.11432

Pith/arXiv arXiv 2024

[42] [42]

Executable code actions elicit better LLM agents.arXiv preprint arXiv:2402.01030, 2024

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents.arXiv preprint arXiv:2402.01030, 2024. URL https://arxiv.org/abs/2402.01030

arXiv 2024

[43] [43]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, et al

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, et al. Openhands: An open platform for ai software developers as generalist agents, 2025. URLhttps://arxiv.org/abs/2407.16741

Pith/arXiv arXiv 2025

[44] [44]

Emergent abilities of large language models, 2022

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, et al. Emergent abilities of large language models, 2022. URL https://arxiv.org/abs/2206.07682

Pith/arXiv arXiv 2022

[45] [45]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URLhttps://arxiv.org/abs/2201.11903

Pith/arXiv arXiv 2023

[46] [46]

AutoGen: Enabling next-gen llm applications via multi-agent conversation, 2023

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, et al. AutoGen: Enabling next-gen llm applications via multi-agent conversation, 2023. URL https://arxiv. org/abs/2308.08155

Pith/arXiv arXiv 2023

[47] [47]

Grok 4.20 reasoning model.https://x.ai/, 2025

xAI. Grok 4.20 reasoning model.https://x.ai/, 2025

2025

[48] [48]

The rise and potential of large language model based agents: A survey, 2023

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, et al. The rise and potential of large language model based agents: A survey, 2023. URL https://arxiv.org/ abs/2309.07864

Pith/arXiv arXiv 2023

[49] [49]

Agentless: Demystifying llm-based software engineering agents, 2024

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. URLhttps://arxiv.org/abs/2407.01489

Pith/arXiv arXiv 2024

[50] [50]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/2404.07972

Pith/arXiv arXiv 2024

[51] [51]

Hydra – a framework for elegantly configuring complex applications

Omry Yadan. Hydra – a framework for elegantly configuring complex applications. https: //github.com/facebookresearch/hydra, 2019. GitHub repository

2019

[52] [52]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025

[53] [53]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

Pith/arXiv arXiv 2024

[54] [54]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024. URLhttps://arxiv.org/abs/2410.03859

arXiv 2024

[55] [55]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https: //arxiv.org/abs/2210.03629. 17 Dissecting model behavior through agent trajectories

Pith/arXiv arXiv 2023

[56] [56]

Autocoderover: Au- tonomous program improvement, 2024

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Au- tonomous program improvement, 2024. URLhttps://arxiv.org/abs/2404.05427

arXiv 2024

[57] [57]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

Pith/arXiv arXiv 2023

[58] [58]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307. 13854. 18 Dissecting model behavior through agent trajectories Appendix Contents The appendi...

2024

[59] [59]

no match

Forced tool-use output.The judge cannot reply with free-form text. Its only allowed output is a single call to thesubmit_classifications tool whose schema only accepts anenum. Outputs are validated client-side before acceptance. A.1.2 Classification rubric Table 4 lists every field the judge assigns to a call. R1–R5 carry the bulk of the behavioural signa...

[60] [60]

fraction-of-fix-achieved

Textual overlap, not semantics.The recall fraction in Eq. 6 counts matching changed lines. A correct-but-textually-different fix, i.e., a different identifier choice, a refactored expression, a guard placed elsewhere scores <1 against reference modes it does not textually match. The empirical subsets and the self-anchor for resolved endpoints mitigate thi...

[61] [61]

Empirical subset ̸= solution space.The space is defined by the test oracle and ˜Si approximates it with the patches we happened to observe (using 21 models run 5 times). A unique correct solution found by nobody else, a real possibility on novel instances, can be far from the observed empirical modes, which is why the self-anchor is reserved for resolved ...

[62] [62]

˜Si is dense on easy instances (for example, in SWE-Bench-Verified80+ reference modes) and sparse on hard ones

Reference-set sparsity scales with difficulty. ˜Si is dense on easy instances (for example, in SWE-Bench-Verified80+ reference modes) and sparse on hard ones. Instances that no model in our sweep solved have only the gold patch as a reference (or no reference if gold is also missing). Don these instances reads against a thin reference set and should be in...

[63] [63]

Judge R#

Replay fidelity. d(t) is reconstructed from the edit-tool calls we parse (Table 5). Therefore, exotic shell rewrites (e.g. Python scripts that open a source file in write mode) are not fully parsed. Self-anchor fixes the resolved endpoint, but it cannot recover the exact intermediate state of an unparsed edit, so a small number of trajectories’ mid-run sh...

arXiv 2025

[64] [68]

User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs)

Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs). Consider the following PR description: <pr_description> {{git_issue}}...

[65] [73]

User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a code repository in the directory {{project_path}} (not in /tmp/inputs)

Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a code repository in the directory {{project_path}} (not in /tmp/inputs). Consider the following PR description: <pr_description> {{git_issue}} </pr_d...

[66] [75]

Create a script to reproduce the error and execute it using the BashTool, to confirm the error

[67] [78]

If any test fails, diagnose the failure and fix your implementation

Run the existing test suite for the affected module. If any test fails, diagnose the failure and fix your implementation

[68] [79]

IMPORTANT rules: - You MUST run the tests frequently, and verify correctness of changes by running relevant tests

Think about edgecases and make sure your fix handles them as well. IMPORTANT rules: - You MUST run the tests frequently, and verify correctness of changes by running relevant tests. If tests fail, analyze failures and revise your patch. - Failing to test sufficiently rigorously is the NUMBER ONE failure mode. - There are hidden tests beyond what is visibl...

[69] [87]

- If any test fails, diagnose the failure, revise your fix, and rerun until they all pass

Find and run the repository’s own existing tests for the files and functions you modified (e.g., ‘pytest path/to/test_file.py‘, the project’s tests, etc.). - If any test fails, diagnose the failure, revise your fix, and rerun until they all pass. - Do not finish until the relevant repo tests pass. Your thinking should be thorough and so it’s fine if it’s ...

[70] [88]

Before exploring anything, use the ‘think‘ tool to write up: - the task restated in your own words - 3-5 hypotheses for the root cause, ranked by likelihood

[71] [89]

Explore the repo to familiarize yourself with its structure

[72] [91]

Use the ‘think‘ tool to list 2-3 candidate fixes in 1-2 lines each, then pick the simplest one

[73] [93]

Rerun your reproduce script and confirm that the error is fixed

[74] [94]

Use the ‘think‘ tool to enumerate 3-5 edge cases for the changed code, then exercise each via the reproduction script or shell

[75] [95]

- If any test fails, diagnose the failure, revise your fix, and rerun until they all pass

Find and run the repository’s own existing tests for the files and functions you modified (e.g., ‘pytest path/to/test_file.py‘, the project’s tests, etc.). - If any test fails, diagnose the failure, revise your fix, and rerun until they all pass. - Do not finish until the relevant repo tests pass. Your thinking should be thorough and so it’s fine if it’s ...

[76] [100]

ideally more than 100 times

Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. In this environment, you can run ‘<apply_patch_command>‘ to execute a diff/patch against a file, where <apply_patch_command> is a specially formatted apply patch command representing the diff you wish to execute. A valid <...

[77] [102]

Create a script to reproduce the error and execute it with ‘python <filename.py>‘ using the BashTool, to confirm the error

[78] [103]

Edit the sourcecode of the repo to resolve the issue

[79] [104]

Rerun your reproduce script and confirm that the error is fixed!

[80] [105]

User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs)

Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs). Consider the following PR description: <pr_description> {{git_issue}}...