pith. sign in

arxiv: 2606.17454 · v2 · pith:OAVPFXTVnew · submitted 2026-06-16 · 💻 cs.AI · cs.LG

Dissecting model behavior through agent trajectories

Pith reviewed 2026-06-27 01:25 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords agent trajectoriescode state-spacesintent-execution gapmodel behavior analysisautonomous problem solvingSSA harnessagent benchmarksphase transitions
0
0 comments X

The pith

Representing agent trajectories in code state-spaces reveals model-level differences in problem-solving behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that AI agent performance depends on closing the intent-execution gap between model intent and harness execution. It introduces a simple customizable harness called SSA that reproduces or improves pass@1 results on benchmarks like SWE-Pro and Terminal-Bench-2 across model families. Analysis of 138k trajectories mapped into code state-spaces shows models differ in how they allocate effort, measured by edit frequency, testing activity, and phase transitions. These finer-grained patterns go beyond aggregate success rates to expose distinct autonomous problem-solving styles. The approach matters because it identifies model-specific behaviors that harness design can address.

Core claim

By representing agent trajectories in code state-spaces, models exhibit observable differences in problem-solving behavior through metrics such as edit frequency, testing activity, and phase-transitions, which indicate how individual models allocate effort across stages of autonomous problem solving even when pass@1 scores are comparable.

What carries the argument

Code state-space representation of agent trajectories, which encodes sequences of code states to quantify metrics like edit frequency and testing activity.

Load-bearing premise

The chosen code state-space representation and SSA harness do not systematically distort observed differences so that trajectories reflect model intent rather than harness artifacts.

What would settle it

Finding that different models produce statistically indistinguishable distributions of edit frequencies, testing activities, and phase-transitions when run in the same SSA harness on the same benchmarks.

Figures

Figures reproduced from arXiv: 2606.17454 by Anoop Deoras, Gaurav Gupta, Jun Huan, Vatshank Chaturvedi.

Figure 1
Figure 1. Figure 1: The intent-execution gap derails an agent. Each panel pairs the model’s intent, or its own reasoning with the execution, i.e., payload the harness received and feedback it provided in a closed-loop. The model sets to start with reading a file, but due to parsing issues in the decoded streams, the harness received an invalid tool name. Harness sends a generic and valid feedback of ‘tool name not found’. In … view at source ↗
Figure 2
Figure 2. Figure 2: SSA architecture. Tasks and model adapters feed a cyclic SimpleStrandAgent loop. The loop streams model output, dispatches tools, executes them in the environment, and appends tool results to conversation state. Hooks validate bounds and record the loop; termination extracts the final environment state and diff patch. Model interface. Adapters map model-provider APIs into a common event interface: assistan… view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning nudges in SSA: a quantitative target for Claude variants and a flexible directive for other model families. 3.3 Aligning harness with the model’s tool use preferences Across agents, while tools function in exactly the same way, models tend to exhibit distinct preferences in how they invoke them. For example, GPT models prefer to update code by using an apply_patch command to splice in text from a… view at source ↗
Figure 4
Figure 4. Figure 4: Agent trajectory in code state space. Solution divergence D(t) evolution as agent traverses code state from the given initial state at t1 till it reaches solution space Si at step t4. Each panel compares the reconstructed live diff with the closest element in Si . Green lines are reproduced patch features and gray lines are still missing. The trace follows a staircase D = 1.0 → D ≈ 0.67 → D = 0.5 → D = 0 a… view at source ↗
Figure 5
Figure 5. Figure 5: Composition of activities is a model signature. Share of three activities (source editing, scratch testing, suite testing) vs normalized cycle position of agent trajectories between 0 and 1, averaged across full benchmark instances (5 runs per instance). For SWE-Bench-Verified (left), Opus 4.6 performs considerable scratch-testing from the beginning and edits peak in the middle, with heavy suite-testing to… view at source ↗
Figure 6
Figure 6. Figure 6: Phase composition surfaces model-specific schedules. Per-cycle share of the four R6 (see LLM judge rubrics in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Solution-distance curves expose different problem-solving styles between strong models. Mean D(t) vs. normalised cycle position on SWE-Bench-Verified, split by resolved (blue) and unresolved (red) trajectories; shaded bands are ±1σ run-to-run variability. Opus 4.6 descends sharply on resolved runs (around t = 0.5) and plateaus high on unresolved ones, cleanly separating progress from failure. Gemini 3.1 Pr… view at source ↗
Figure 8
Figure 8. Figure 8: Single-edit collapse. A single correct edit can move an agent directly from the initial state to the solution set. Faithful replay of the GPT-5.4 trajectory on astropy__astropy-12907 (SWE-Bench Verified) has only two relevant states: the initial repository state x(t1) before any edit (D=1) and the post-edit state x(t2) after the decisive edit at cycle position t ≈ 0.53 (D=0). Green tiles mark reference pat… view at source ↗
Figure 9
Figure 9. Figure 9: Fixed-reference staircase. Against a fixed reference, solution distance behaves like recall over patch features. The Opus 4.6 trajectory for django__django-11292 is scored against one 8-feature reference held constant across the four panels. Green tiles are reference features reproduced by the live repository state, and gray tiles are missing features. As the agent lands successive hunks, the matched count… view at source ↗
Figure 10
Figure 10. Figure 10: Max-over-modes view. The full metric tracks the nearest empirical solution mode, not a single fixed patch. The same Opus 4.6 run on django__django-11292 is re-rendered with the displayed reference chosen independently at each snapshot by the maximisation in Eq. 5. The trajectory descends 1 → 0.667 → 0.5 → 0 between cycles 0.273 and 0.292, then remains at zero through three later refinements. The best refe… view at source ↗
Figure 11
Figure 11. Figure 11: Found-it-then-lost-it. In this Gemini 3 Flash trajectory on django__django-14140, the first edit exactly reproduces a popular 12-feature rewrite in S˜ i (22 of ∼ 30 resolved runs, also the dataset gold), so D drops from 1 to 0 at cycle 0.15. The next edit reverts that rewrite and substitutes a multi-condition guard that overlaps only one reference feature, so D jumps back to 0.917 (rounded to 0.9); the up… view at source ↗
Figure 12
Figure 12. Figure 12: Revert as discovery. Gemini 3.1 Pro run on astropy-12907 (SWE-Bench-Verified), the agent first reaches the gold-mode fix, deliberately reverts it to test the unfixed baseline, and uncovers an pre-existing unrelated bug. Faithful replay records the productive backtrack (D : 1→0→1→ 0.5 → 0): the final resolved patch combines the original _cstack fix with a new np.roll(..., axis=0) fix. 28 [PITH_FULL_IMAGE:… view at source ↗
Figure 13
Figure 13. Figure 13: SWE-Bench-Verified: total output tokens vs. pass@1, one point per model. Output tokens [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Backtracking ∆D histograms (part 1 of 2): Anthropic Claude and OpenAI families. Each cell is one model, green = progress edits, red = backtracking edits. Annotation shows the per-model backtrack-edit share. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Backtracking ∆D histograms (part 2 of 2). Gemini, Grok and Qwen families. Continued from [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Edit/test ratio per cycle (part 1 of 2): Anthropic Claude and OpenAI families. Each cell [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Edit/test ratio per cycle (part 2 of 2). Gemini, Grok and Qwen families. Continued from [PITH_FULL_IMAGE:figures/full_fig_p038_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Mean solution-distance curves D(t) (part 1 of 2): Anthropic Claude and OpenAI families. Blue = resolved trajectories, red = unresolved. Shaded bands show ±1σ run-to-run variability (∼5 runs/instance, averaged across instances). 39 [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Mean solution-distance curves (part 2 of 2): Gemini, Grok and Qwen families. Shaded [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Genuine tool-error rate per cycle (part 1 of 2): Anthropic Claude and OpenAI families. [PITH_FULL_IMAGE:figures/full_fig_p041_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Genuine tool-error rate per cycle (part 2 of 2): Gemini, Grok and Qwen families. Continued [PITH_FULL_IMAGE:figures/full_fig_p042_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Phase composition R6 (see Table 4 for LLM judge rubrics) explore [PITH_FULL_IMAGE:figures/full_fig_p043_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Phase composition (part 2 of 2): Gemini, Grok and Qwen families. Continued from [PITH_FULL_IMAGE:figures/full_fig_p044_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Tool-call distribution per model on SWE-Bench-Verified (part 1 of 2): Anthropic Claude [PITH_FULL_IMAGE:figures/full_fig_p045_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Tool-call distribution per model on SWE-Bench-Verified (part 2 of 2): Gemini, Grok and [PITH_FULL_IMAGE:figures/full_fig_p046_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: reports the same token-vs-pass@1 Pareto view for SWE-Bench-Pro that [PITH_FULL_IMAGE:figures/full_fig_p047_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Backtracking ∆D histograms on SWE-Bench-Pro (part 1 of 2): Anthropic Claude and OpenAI families. Each cell is one model; green = progress edits, red = backtracking edits. Annotation shows the per-model backtrack-edit share. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Backtracking ∆D histograms on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and Qwen families. Continued from [PITH_FULL_IMAGE:figures/full_fig_p049_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Edit/test ratio per cycle on SWE-Bench-Pro (part 1 of 2): Anthropic Claude and OpenAI [PITH_FULL_IMAGE:figures/full_fig_p051_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Edit/test ratio per cycle on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and Qwen families. [PITH_FULL_IMAGE:figures/full_fig_p052_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Mean solution-distance curves D(t) on SWE-Bench-Pro (part 1 of 2): Anthropic Claude and OpenAI families. Blue = resolved trajectories, red = unresolved. Shaded bands show ±1σ run-to-run variability (∼5 runs/instance, averaged across instances). 53 [PITH_FULL_IMAGE:figures/full_fig_p053_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Mean solution-distance curves on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and Qwen [PITH_FULL_IMAGE:figures/full_fig_p054_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Genuine tool-error rate per cycle on SWE-Bench-Pro (part 1 of 2): Anthropic Claude and [PITH_FULL_IMAGE:figures/full_fig_p055_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Genuine tool-error rate per cycle on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and [PITH_FULL_IMAGE:figures/full_fig_p056_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Phase composition on SWE-Bench-Pro R6 (see Table 4 for LLM judge rubrics) explore [PITH_FULL_IMAGE:figures/full_fig_p057_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Phase composition on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and Qwen families. [PITH_FULL_IMAGE:figures/full_fig_p058_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Tool-call distribution per model on SWE-Bench-Pro (part 1 of 2): Anthropic Claude and [PITH_FULL_IMAGE:figures/full_fig_p059_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Tool-call distribution per model on SWE-Bench-Pro (part 2 of 2): Gemini, Grok and [PITH_FULL_IMAGE:figures/full_fig_p060_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Terminal-Bench-2: total output tokens vs. pass@1, one point per model. Output tokens are [PITH_FULL_IMAGE:figures/full_fig_p062_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Phase composition on Terminal-Bench-2 R6 (see Table 4 for LLM judge rubrics) explore [PITH_FULL_IMAGE:figures/full_fig_p063_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Phase composition on Terminal-Bench-2 (part 2 of 2): Gemini, Grok and Qwen families. [PITH_FULL_IMAGE:figures/full_fig_p064_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Genuine tool-error rate per cycle on Terminal-Bench-2 (part 1 of 2): Anthropic Claude [PITH_FULL_IMAGE:figures/full_fig_p065_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Genuine tool-error rate per cycle on Terminal-Bench-2 (part 2 of 2): Gemini, Grok and [PITH_FULL_IMAGE:figures/full_fig_p066_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: Edit/test ratio per cycle on Terminal-Bench-2 (part 1 of 2): Anthropic Claude and OpenAI [PITH_FULL_IMAGE:figures/full_fig_p067_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Edit/test ratio per cycle on Terminal-Bench-2 (part 2 of 2): Gemini, Grok and Qwen [PITH_FULL_IMAGE:figures/full_fig_p068_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: Tool-call distribution per model on Terminal-Bench-2 (part 1 of 2): Anthropic Claude and [PITH_FULL_IMAGE:figures/full_fig_p069_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: Tool-call distribution per model on Terminal-Bench-2 (part 2 of 2): Gemini, Grok and [PITH_FULL_IMAGE:figures/full_fig_p070_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: SWE-Pro Pass@1 under Base and Sanitized containers. The shaded overlay is the [PITH_FULL_IMAGE:figures/full_fig_p072_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: SWE-Bench-Pro Pass@1 for Qwen models under Base, Git Instructions, and Sanitized [PITH_FULL_IMAGE:figures/full_fig_p073_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: A corrupted tool-name causes the model to abandon an intended read (gpt-oss-120b, sympy-13974 (SWE-Bench-Verified). The model sets out to read add.py around _eval_power. A stray <|channel|> is folded into the file_read recipient, producing file_read<|channel|>commentary. The harness rejects the call with invalid tool name pattern. Because the injected token is invisible to the model, it concludes that it … view at source ↗
Figure 51
Figure 51. Figure 51: Token-level view of the corrupt-name example in [PITH_FULL_IMAGE:figures/full_fig_p106_51.png] view at source ↗
read the original abstract

AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we reproduce or improve on the pass@1 performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an analysis of 138k trajectories generated by SSA, we look beyond the pass@1 numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper formalizes the 'intent-execution' gap between model intent and harness execution in AI agents. It introduces the Simple Strands Agent (SSA) harness, reports reproducing or improving pass@1 on SWE-Pro, SWE-Verified, and Terminal-Bench-2 across model families (Claude, Gemini, GPT, Grok, Qwen), and analyzes 138k trajectories represented in code state-spaces to identify model-level differences in finer-grained behaviors including edit frequency, testing activity, and phase transitions.

Significance. If the reported behavioral differences prove robust beyond the specific SSA harness, the work would offer a concrete methodology for moving past aggregate pass@1 metrics to understand how models allocate effort across problem-solving stages, directly supporting harness-model alignment research.

major comments (1)
  1. [Trajectory analysis and experimental setup] The central claim that model-level differences in edit frequency, testing activity, and phase-transitions are observable via code state-spaces rests on trajectories generated exclusively by the single SSA harness (fixed state representation, loop structure, and prompting). Given the paper's own observation of roughly comparable pass@1 across families, systematic harness-model coupling could produce the divergences without reflecting intrinsic strategies; no cross-harness ablation or variation control is described to isolate model effects.
minor comments (1)
  1. [Methods / Results] The abstract provides no detail on trajectory sampling procedure, precise definition of the code state-spaces, or application of multiple-testing correction to the reported metric differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We address the single major comment below.

read point-by-point responses
  1. Referee: [Trajectory analysis and experimental setup] The central claim that model-level differences in edit frequency, testing activity, and phase-transitions are observable via code state-spaces rests on trajectories generated exclusively by the single SSA harness (fixed state representation, loop structure, and prompting). Given the paper's own observation of roughly comparable pass@1 across families, systematic harness-model coupling could produce the divergences without reflecting intrinsic strategies; no cross-harness ablation or variation control is described to isolate model effects.

    Authors: We agree this is a valid concern. The 138k trajectories were generated exclusively with the SSA harness, and no cross-harness ablations or controlled variations in state representation, loop structure, or prompting were performed. SSA was designed as a minimal, general-purpose harness to reduce the intent-execution gap and enable consistent comparison across model families, with the similar pass@1 scores providing some evidence of harness parity. Nevertheless, the possibility of harness-model interactions cannot be ruled out from the current data alone. We will revise the manuscript to explicitly acknowledge this limitation in the experimental setup and trajectory analysis sections and to identify cross-harness validation as an important direction for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical trajectory analysis is self-contained

full rationale

The paper reports new experimental runs of 138k trajectories on public benchmarks using the SSA harness, followed by direct observation of metrics such as edit frequency and phase transitions. No equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on fresh data collection and comparison across models rather than any reduction of outputs to inputs by construction, satisfying the default expectation of non-circularity for empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical paper; no mathematical axioms or invented entities. The central claims rest on the design choices of the SSA harness and the definition of the trajectory state-space, which function as domain assumptions rather than derived quantities.

pith-pipeline@v0.9.1-grok · 5811 in / 1196 out tokens · 28555 ms · 2026-06-27T01:25:05.359186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 3 canonical work pages

  1. [1]

    The Amazon Nova family of foundation models

    Amazon AGI. The Amazon Nova family of foundation models. https://aws.amazon.com/ nova/, 2024

  2. [2]

    A general path-based representation for predicting program properties, 2018

    Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. A general path-based representation for predicting program properties, 2018. URLhttps://arxiv.org/abs/1803.09544. 14 Dissecting model behavior through agent trajectories

  3. [3]

    The Claude 4 model family: System cards and capability notes

    Anthropic. The Claude 4 model family: System cards and capability notes. https://www. anthropic.com/claude, 2025

  4. [4]

    Strands Agents: A model-driven SDK for building AI agents.https://strandsagents

    AWS. Strands Agents: A model-driven SDK for building AI agents.https://strandsagents. com, 2025. Open-source SDK

  5. [5]

    Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro

    Earl T. Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. The plastic surgery hypothesis. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, page 306–317, New York, NY , USA, 2014. Association for Computing Machinery. ISBN 9781450330565. doi: 10.1145/2635868.2635898. URLhttps...

  6. [6]

    LiteLLM: A unified gateway and proxy for LLM APIs

    BerriAI. LiteLLM: A unified gateway and proxy for LLM APIs. https://github.com/ BerriAI/litellm, 2023. Open-source library

  7. [7]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005. 14165

  8. [8]

    Evaluating large language models trained on code, 2021

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, et al. Evaluating large language models trained on code, 2021. URL https: //arxiv.org/abs/2107.03374

  9. [9]

    Gemini 3: Pro and Flash

    Google DeepMind. Gemini 3: Pro and Flash. https://deepmind.google/technologies/ gemini/, 2025

  10. [10]

    Deepseek-v3 technical report, 2025

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, et al. Deepseek-v3 technical report, 2025. URLhttps://arxiv.org/abs/2412.19437

  11. [11]

    SWE- Bench Pro: Can AI agents solve long-horizon software engineering tasks?, 2025

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, et al. SWE- Bench Pro: Can AI agents solve long-horizon software engineering tasks?, 2025. URL https: //arxiv.org/abs/2509.16941

  12. [12]

    Agentic RL: Token-in, token-out done right, 2026

    Quentin Gallouédec and Kashif Rasul. Agentic RL: Token-in, token-out done right, 2026. Accessed 2026-06-07

  13. [13]

    GraphCodeBERT: Pre-training code representations with data flow, 2021

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, et al. GraphCodeBERT: Pre-training code representations with data flow, 2021. URLhttps://arxiv.org/abs/2009. 08366

  14. [14]

    Deepseek- r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638,

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, et al. Deepseek- r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638,

  15. [15]

    Deng, J., Li, T.-W., Zhang, S., Liu, S., Pan, Y ., Huang, H., Wang, X., Hu, P., Zhang, X., et al

    ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx.doi.org/10. 1038/s41586-025-09422-z

  16. [16]

    Qwen2.5-coder technical report, 2024

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, et al. Qwen2.5-coder technical report, 2024. URLhttps://arxiv.org/abs/2409.12186

  17. [17]

    LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code, 2024

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code, 2024. URL https://arxiv.org/ abs/2403.07974

  18. [18]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URLhttps://arxiv.org/abs/2310.06770

  19. [19]

    CODESTRUCT: Code agents over structured action spaces, 2026

    Myeongsoo Kim, Joe Hsu, Dingmin Wang, Shweta Garg, Varun Kumar, and Murali Krishna Ramanathan. CODESTRUCT: Code agents over structured action spaces, 2026. URL https: //arxiv.org/abs/2604.05407

  20. [20]

    Large language models are zero-shot reasoners, 2023

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023. URL https://arxiv.org/abs/2205.11916. 15 Dissecting model behavior through agent trajectories

  21. [21]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention, 2023. URL https://arxiv.org/abs/2309. 06180

  22. [22]

    A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each

    Claire Le Goues, Michael Dewey-V ogt, Stephanie Forrest, and Westley Weimer. A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each. InProceedings of the 34th International Conference on Software Engineering, ICSE ’12, page 3–13. IEEE Press,

  23. [23]

    Automated program repair

    Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. Automated program repair. Commun. ACM, 62(12):56–65, November 2019. ISSN 0001-0782. doi: 10.1145/3318162. URL https://doi.org/10.1145/3318162

  24. [24]

    Repobench: Benchmarking repository-level code auto-completion systems, 2023

    Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2023. URLhttps://arxiv.org/abs/2306.03091

  25. [25]

    Agentbench: Evaluating llms as agents, 2025

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, et al. Agentbench: Evaluating llms as agents, 2025. URLhttps://arxiv.org/abs/2308.03688

  26. [26]

    An analysis of the search spaces for generate and validate patch generation systems, 2016

    Fan Long and Martin Rinard. An analysis of the search spaces for generate and validate patch generation systems, 2016. URLhttps://arxiv.org/abs/1602.05643

  27. [27]

    GPT-4.1 prompting guide

    Noah MacCallum and Julian Lee. GPT-4.1 prompting guide. https://cookbook.openai. com/examples/gpt4-1_prompting_guide/, 2025. OpenAI Cookbook

  28. [28]

    Data contamination: From memorization to exploitation, 2022

    Inbal Magar and Roy Schwartz. Data contamination: From memorization to exploitation, 2022. URLhttps://arxiv.org/abs/2203.08242

  29. [29]

    Merrill, Alexander G

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URLhttps://arxiv.org/abs/2601.11868

  30. [30]

    MiniMax-M2: A foundation model with extended context and tool use

    MiniMax AI. MiniMax-M2: A foundation model with extended context and tool use. https: //huggingface.co/MiniMaxAI, 2025

  31. [31]

    Alberto Moraglio, Krzysztof Krawiec, and Colin G. Johnson. Geometric semantic genetic programming. InParallel Problem Solving from Nature - PPSN XII, pages 21–31, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-32937-1

  32. [32]

    The GPT-5 model family and the gpt-oss open-weights release

    OpenAI. The GPT-5 model family and the gpt-oss open-weights release. https://openai. com/, 2025

  33. [33]

    openai-harmony: Response format for the gpt-oss models

    OpenAI. openai-harmony: Response format for the gpt-oss models. https://github.com/ openai/harmony, 2025. Accessed 2026-06-07

  34. [34]

    ToolLLM: Facilitating large language models to master 16000+ real-world apis, 2023

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, et al. ToolLLM: Facilitating large language models to master 16000+ real-world apis, 2023. URLhttps://arxiv.org/abs/2307.16789

  35. [35]

    Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025

    Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, et al. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025. URL https://arxiv.org/abs/2504. 08703

  36. [36]

    Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023

    Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023. URLhttps://arxiv.org/abs/2310.18018

  37. [37]

    Toolformer: Language models can teach themselves to use tools, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URLhttps://arxiv.org/abs/2302.04761

  38. [38]

    Building effective agents

    Erik Schluntz and Barry Zhang. Building effective agents. https://www.anthropic.com/ research/building-effective-agents, 2024. Anthropic engineering blog. 16 Dissecting model behavior through agent trajectories

  39. [39]

    Reflexion: Language agents with verbal reinforcement learning, 2023

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303.11366

  40. [40]

    Gradient-based program repair: Fixing bugs in continuous program spaces, 2026

    André Silva, Gustav Thorén, and Martin Monperrus. Gradient-based program repair: Fixing bugs in continuous program spaces, 2026. URLhttps://arxiv.org/abs/2505.17703

  41. [41]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024. URLhttps://arxiv.org/abs/2308.11432

  42. [42]

    Executable code actions elicit better LLM agents.arXiv preprint arXiv:2402.01030, 2024

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents.arXiv preprint arXiv:2402.01030, 2024. URL https://arxiv.org/abs/2402.01030

  43. [43]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, et al

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, et al. Openhands: An open platform for ai software developers as generalist agents, 2025. URLhttps://arxiv.org/abs/2407.16741

  44. [44]

    Emergent abilities of large language models, 2022

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, et al. Emergent abilities of large language models, 2022. URL https://arxiv.org/abs/2206.07682

  45. [45]

    Chain-of-thought prompting elicits reasoning in large language models, 2023

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URLhttps://arxiv.org/abs/2201.11903

  46. [46]

    AutoGen: Enabling next-gen llm applications via multi-agent conversation, 2023

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, et al. AutoGen: Enabling next-gen llm applications via multi-agent conversation, 2023. URL https://arxiv. org/abs/2308.08155

  47. [47]

    Grok 4.20 reasoning model.https://x.ai/, 2025

    xAI. Grok 4.20 reasoning model.https://x.ai/, 2025

  48. [48]

    The rise and potential of large language model based agents: A survey, 2023

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, et al. The rise and potential of large language model based agents: A survey, 2023. URL https://arxiv.org/ abs/2309.07864

  49. [49]

    Agentless: Demystifying llm-based software engineering agents, 2024

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. URLhttps://arxiv.org/abs/2407.01489

  50. [50]

    OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/2404.07972

  51. [51]

    Hydra – a framework for elegantly configuring complex applications

    Omry Yadan. Hydra – a framework for elegantly configuring complex applications. https: //github.com/facebookresearch/hydra, 2019. GitHub repository

  52. [52]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  53. [53]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

  54. [54]

    Jimenez, Alex L

    John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024. URLhttps://arxiv.org/abs/2410.03859

  55. [55]

    React: Synergizing reasoning and acting in language models, 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https: //arxiv.org/abs/2210.03629. 17 Dissecting model behavior through agent trajectories

  56. [56]

    Autocoderover: Au- tonomous program improvement, 2024

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Au- tonomous program improvement, 2024. URLhttps://arxiv.org/abs/2404.05427

  57. [57]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

  58. [58]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307. 13854. 18 Dissecting model behavior through agent trajectories Appendix Contents The appendi...

  59. [59]

    no match

    Forced tool-use output.The judge cannot reply with free-form text. Its only allowed output is a single call to thesubmit_classifications tool whose schema only accepts anenum. Outputs are validated client-side before acceptance. A.1.2 Classification rubric Table 4 lists every field the judge assigns to a call. R1–R5 carry the bulk of the behavioural signa...

  60. [60]

    fraction-of-fix-achieved

    Textual overlap, not semantics.The recall fraction in Eq. 6 counts matching changed lines. A correct-but-textually-different fix, i.e., a different identifier choice, a refactored expression, a guard placed elsewhere scores <1 against reference modes it does not textually match. The empirical subsets and the self-anchor for resolved endpoints mitigate thi...

  61. [61]

    Empirical subset ̸= solution space.The space is defined by the test oracle and ˜Si approximates it with the patches we happened to observe (using 21 models run 5 times). A unique correct solution found by nobody else, a real possibility on novel instances, can be far from the observed empirical modes, which is why the self-anchor is reserved for resolved ...

  62. [62]

    ˜Si is dense on easy instances (for example, in SWE-Bench-Verified80+ reference modes) and sparse on hard ones

    Reference-set sparsity scales with difficulty. ˜Si is dense on easy instances (for example, in SWE-Bench-Verified80+ reference modes) and sparse on hard ones. Instances that no model in our sweep solved have only the gold patch as a reference (or no reference if gold is also missing). Don these instances reads against a thin reference set and should be in...

  63. [63]

    Judge R#

    Replay fidelity. d(t) is reconstructed from the edit-tool calls we parse (Table 5). Therefore, exotic shell rewrites (e.g. Python scripts that open a source file in write mode) are not fully parsed. Self-anchor fixes the resolved endpoint, but it cannot recover the exact intermediate state of an unparsed edit, so a small number of trajectories’ mid-run sh...

  64. [68]

    User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs)

    Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs). Consider the following PR description: <pr_description> {{git_issue}}...

  65. [73]

    User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a code repository in the directory {{project_path}} (not in /tmp/inputs)

    Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a code repository in the directory {{project_path}} (not in /tmp/inputs). Consider the following PR description: <pr_description> {{git_issue}} </pr_d...

  66. [75]

    Create a script to reproduce the error and execute it using the BashTool, to confirm the error

  67. [78]

    If any test fails, diagnose the failure and fix your implementation

    Run the existing test suite for the affected module. If any test fails, diagnose the failure and fix your implementation

  68. [79]

    IMPORTANT rules: - You MUST run the tests frequently, and verify correctness of changes by running relevant tests

    Think about edgecases and make sure your fix handles them as well. IMPORTANT rules: - You MUST run the tests frequently, and verify correctness of changes by running relevant tests. If tests fail, analyze failures and revise your patch. - Failing to test sufficiently rigorously is the NUMBER ONE failure mode. - There are hidden tests beyond what is visibl...

  69. [87]

    - If any test fails, diagnose the failure, revise your fix, and rerun until they all pass

    Find and run the repository’s own existing tests for the files and functions you modified (e.g., ‘pytest path/to/test_file.py‘, the project’s tests, etc.). - If any test fails, diagnose the failure, revise your fix, and rerun until they all pass. - Do not finish until the relevant repo tests pass. Your thinking should be thorough and so it’s fine if it’s ...

  70. [88]

    Before exploring anything, use the ‘think‘ tool to write up: - the task restated in your own words - 3-5 hypotheses for the root cause, ranked by likelihood

  71. [89]

    Explore the repo to familiarize yourself with its structure

  72. [91]

    Use the ‘think‘ tool to list 2-3 candidate fixes in 1-2 lines each, then pick the simplest one

  73. [93]

    Rerun your reproduce script and confirm that the error is fixed

  74. [94]

    Use the ‘think‘ tool to enumerate 3-5 edge cases for the changed code, then exercise each via the reproduction script or shell

  75. [95]

    - If any test fails, diagnose the failure, revise your fix, and rerun until they all pass

    Find and run the repository’s own existing tests for the files and functions you modified (e.g., ‘pytest path/to/test_file.py‘, the project’s tests, etc.). - If any test fails, diagnose the failure, revise your fix, and rerun until they all pass. - Do not finish until the relevant repo tests pass. Your thinking should be thorough and so it’s fine if it’s ...

  76. [100]

    ideally more than 100 times

    Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. In this environment, you can run ‘<apply_patch_command>‘ to execute a diff/patch against a file, where <apply_patch_command> is a specially formatted apply patch command representing the diff you wish to execute. A valid <...

  77. [102]

    Create a script to reproduce the error and execute it with ‘python <filename.py>‘ using the BashTool, to confirm the error

  78. [103]

    Edit the sourcecode of the repo to resolve the issue

  79. [104]

    Rerun your reproduce script and confirm that the error is fixed!

  80. [105]

    User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs)

    Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it’s fine if it’s very long. User: <uploaded_files> {{project_path}} </uploaded_files> I’ve uploaded a python code repository in the directory {{project_path}} (not in /tmp/inputs). Consider the following PR description: <pr_description> {{git_issue}}...

Showing first 80 references.