pith. machine review for the scientific record. sign in

arxiv: 2604.02547 · v1 · submitted 2026-04-02 · 💻 cs.SE

Recognition: no theorem link

Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:26 UTC · model grok-4.3

classification 💻 cs.SE
keywords coding agentsLLMsoftware engineeringempirical studyagent behaviorfailure analysisframework comparison
0
0 comments X

The pith

The LLM is the main driver of coding agent success and behavior, more so than the framework it runs in.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes thousands of trajectories from 19 coding agents built on 8 frameworks and 14 LLMs across 500 tasks. It finds that patch complexity does not explain why certain tasks remain unsolved, even when human annotators rate them as easy. Behavioral patterns such as gathering context before editing and performing validation steps separate success from failure once task difficulty is held constant. The central result is that agents sharing the same LLM agree on far more tasks than agents sharing the same framework, and framework performance gaps narrow as LLMs advance.

Core claim

The LLM is the primary driver of both outcome and behavior: agents sharing the same LLM agree on far more tasks than agents sharing the same framework, and the framework performance gap shrinks with each generation of LLM improvement. Framework prompts influence agent tactics, but this influence diminishes with stronger LLMs.

What carries the argument

Comparison of task agreement rates between pairs of agents that share the LLM versus pairs that share the framework.

If this is right

  • Agents that gather context before editing and invest in validation succeed more often.
  • Trajectory length correlates with failure only because it tracks task difficulty; after control, longer trajectories indicate more thorough work.
  • Behavioral strategies are set by the agent rather than adapting to each task.
  • Framework differences affect tactics but lose impact as LLMs improve.
  • Twelve tasks with simple patches stay unsolved because agents lack architectural reasoning or domain knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work on coding agents should prioritize LLM selection over framework engineering.
  • Benchmarks need more tasks that test domain knowledge and architecture to expose remaining failure modes.
  • With stronger LLMs, coding agents may converge in performance regardless of framework.

Load-bearing premise

Controlling for task difficulty fully isolates behavioral effects, and the 500 tasks with 19 agents are representative enough for the patterns to hold beyond this benchmark.

What would settle it

A new study on a fresh set of tasks in which agents sharing frameworks agree on more solved tasks than agents sharing LLMs.

Figures

Figures reproduced from arXiv: 2604.02547 by Tural Mehtiyev, Wesley Assun\c{c}\~ao.

Figure 1
Figure 1. Figure 1: Contrasting agent trajectories on django-15863. Both agents use the SWE-agent framework. Claude 4 Sonnet (left) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Task difficulty distribution across 500 SWE-bench [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative case study: matplotlib-23476. The [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trajectory length by task difficulty for all 19 agents. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Same task and length (32 steps), different agent, dif [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Three structural dimensions of RQ2b across all agents. (a) Context gathering: agents that delay their first edit succeed [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Strategy stability across task complexity for 6 representative agents (3 SWE-agent in red, 3 OpenHands in blue). [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Resolution rate for 19 agents on SWE-bench Verified [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-task outcome agreement. (a) Same LLM across [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System prompt comparison for claude-4-sonnet on SWE-agent (top) vs. OpenHands (bottom). Quoted text extracted [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
read the original abstract

Coding agents represent a new paradigm in automated software engineering, combining the reasoning capabilities of Large Language Models (LLMs) with tool-augmented interaction loops. However, coding agents still have severe limitations. Top-ranked LLM-based coding agents still fail on over 20% of benchmarked problems. Yet, we lack a systematic understanding of why (i.e., the causes) agents fail, and how failure unfolds behaviorally. We present a large-scale empirical study analyzing 9,374 trajectories from 19 agents (8 coding agent frameworks, 14 LLMs) on 500 tasks. We organize our analysis around three research questions. First, we investigate why agents fail on specific tasks and find that patch complexity alone does not explain difficulty: 12 never-solved tasks require only simple patches and were considered easy by human annotators, yet all agents fail due to gaps in architectural reasoning and domain knowledge. Second, we examine how behavioral patterns differentiate success from failure. The widely reported correlation between trajectory length and failure reverses direction once task difficulty is controlled, revealing it as a confound. Instead, trajectory structure discriminates consistently: agents that gather context before editing and invest in validation succeed more often, and these strategies are agent-determined rather than task-adaptive. Third, we disentangle LLM capability from framework design and find that the LLM is the primary driver of both outcome and behavior: agents sharing the same LLM agree on far more tasks than agents sharing the same framework, and the framework performance gap shrinks with each generation of LLM improvement. Framework prompts do influence agent tactics, but this influence diminishes with stronger LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper conducts a large-scale empirical study of 9,374 trajectories from 19 coding agents (8 frameworks and 14 LLMs) on 500 tasks. It investigates three questions: why agents fail on specific tasks (patch complexity does not explain 12 never-solved tasks that require simple patches but fail due to architectural reasoning and domain knowledge gaps), how behavioral patterns differentiate success from failure (after controlling for task difficulty, longer trajectories no longer predict failure; instead, context-gathering before editing and validation investment consistently discriminate success and are agent-determined), and the relative roles of LLMs versus frameworks (LLM is the primary driver, as same-LLM agents agree on far more tasks than same-framework agents, with framework performance gaps shrinking across LLM generations).

Significance. If the results hold, the work offers a substantive advance in understanding coding agents by shifting focus from aggregate resolution rates to identifiable behavioral drivers and the dominant influence of LLM choice over framework design. The large trajectory count, explicit difficulty controls, and direct empirical comparisons provide a stronger foundation than prior smaller-scale studies, with potential to inform more effective agent architectures and evaluation practices in automated software engineering.

major comments (1)
  1. [LLM vs framework analysis] The section on disentangling LLM and framework effects: the claim that the LLM is the primary driver because agents sharing the same LLM agree on far more tasks than those sharing the same framework lacks reported same-LLM pair counts, exact agreement fractions, and any statistical test (e.g., permutation or bootstrap) showing the difference exceeds chance or selection effects. With only 19 agents total, same-LLM pairs are necessarily sparse and may overlap with framework-specific choices, making the attribution sensitive to the particular combinations studied; this is load-bearing for the central claim and requires quantitative support.
minor comments (2)
  1. [Abstract] Abstract: provides only high-level descriptions of how task difficulty was measured and how statistical controls were applied; adding one sentence on the concrete metrics or regression approach would improve clarity.
  2. [Behavioral analysis] Behavioral patterns section: the quantification of 'trajectory structure' (context gathering, validation investment) should include explicit definitions or pseudocode for the derived features to allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. The comment on the LLM-versus-framework analysis has prompted us to strengthen the quantitative support for our central claim.

read point-by-point responses
  1. Referee: [LLM vs framework analysis] The section on disentangling LLM and framework effects: the claim that the LLM is the primary driver because agents sharing the same LLM agree on far more tasks than those sharing the same framework lacks reported same-LLM pair counts, exact agreement fractions, and any statistical test (e.g., permutation or bootstrap) showing the difference exceeds chance or selection effects. With only 19 agents total, same-LLM pairs are necessarily sparse and may overlap with framework-specific choices, making the attribution sensitive to the particular combinations studied; this is load-bearing for the central claim and requires quantitative support.

    Authors: We agree that the original presentation omitted the explicit counts, fractions, and statistical validation needed to substantiate the claim. In the revised manuscript we have added a new subsection (Section 5.3) that reports: the exact number of same-LLM pairs (21) and same-framework pairs (28) that can be formed from the 19 agents; the mean agreement rate across same-LLM pairs (0.73, SD 0.11) versus same-framework pairs (0.47, SD 0.14); and the results of a permutation test (10,000 iterations) that randomly reassigns LLMs while preserving framework structure, yielding p = 0.002. A bootstrap analysis of the difference in means further confirms significance. To address possible overlap and selection effects, we additionally report agreement rates restricted to cross-framework same-LLM pairs and show that the LLM advantage remains. These additions directly supply the quantitative support requested and demonstrate that the observed difference is unlikely to arise from the particular agent combinations in our study. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical analysis

full rationale

The paper's central claims rest on direct empirical counts and comparisons from 9,374 trajectories across 19 agents on 500 tasks. The finding that agents sharing the same LLM agree on more tasks than those sharing frameworks is derived from observed agreement rates and behavioral patterns after controlling for task difficulty, with no equations, fitted parameters, self-definitions, or self-citation chains that reduce outcomes to inputs by construction. No ansatzes, uniqueness theorems, or renamings of known results are invoked in a load-bearing way; the analysis is self-contained against the benchmark data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study uses standard empirical methods for trajectory analysis and statistical control; no free parameters, new axioms, or invented entities are introduced.

axioms (1)
  • standard math Standard statistical assumptions required to control for task difficulty when measuring correlations
    Invoked to reverse the trajectory-length correlation and isolate behavioral effects

pith-pipeline@v0.9.0 · 5595 in / 1246 out tokens · 42310 ms · 2026-05-13T20:26:57.050806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Counterfactual Trace Auditing of LLM Agent Skills

    cs.AI 2026-05 unverdicted novelty 7.0

    CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann. 2008. What Makes a Good Bug Report?. InProceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). 308–318. doi:10.1145/1453101.1453146

  2. [2]

    Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineer- ing Agents: A Study of Thought-Action-Result Trajectories. In40th IEEE/ACM International Conference on Automated Software Engineering (ASE)

  3. [3]

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christo- pher Ré, and Azalia Mirhoseini. 2024. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787(2024)

  4. [4]

    Ira Ceka, Saurabh Pujar, Shyam Ramji, et al . 2025. Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study.arXiv preprint arXiv:2506.08311(2025)

  5. [5]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021)

  6. [6]

    Zhi Chen et al. 2025. Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios.arXiv preprint arXiv:2503.12374(2025). Accepted at ICSE 2026

  7. [7]

    Zhi Chen et al. 2026. Rethinking the Value of Agent-Generated Tests for LLM- Based Software Engineering Agents.arXiv preprint arXiv:2602.07900(2026)

  8. [8]

    Ramtin Ehsani, Sakshi Pathak, Shriya Rawal, et al. 2026. Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub.arXiv preprint arXiv:2601.15195(2026)

  9. [9]

    Jatin Ganhotra. 2025. Cracking the Code: How Difficult Are SWE-Bench-Verified Tasks Really? https://jatinganhotra.dev/blog/swe-agents/2025/04/15/swe-bench- verified-easy-medium-hard.html. Blog post

  10. [10]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation.arXiv preprint arXiv:2406.00515(2024)

  11. [11]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)

  12. [12]

    Mosh Levy, Alon Jacoby, and Yoav Goldberg. 2024. Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models. In62nd Annual Meeting of the Association for Computational Linguistics (ACL)

  13. [13]

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large Language Model-Based Agents for Software Engineering: A Survey.arXiv preprint arXiv:2409.02977(2024)

  14. [14]

    Shuyang Liu, Yang Chen, Rahul Krishna, et al. 2026. Process-Centric Analysis of Agentic Software Systems.arXiv preprint arXiv:2512.02393(2026)

  15. [15]

    Simiao Liu, Fang Liu, Liehao Li, Xin Tan, Yinghao Zhu, Xiaoli Lian, and Li Zhang. 2025. An Empirical Study on Failures in Automated Issue Solving. (2025). arXiv:2509.13941 [cs.SE] https://arxiv.org/abs/2509.13941

  16. [16]

    Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, and He Ye. 2025. Un- derstanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories.arXiv preprint arXiv:2511.00197(2025)

  17. [17]

    Matias Martinez and Xavier Franch. 2025. Dissecting the SWE-Bench Leader- boards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems.arXiv preprint arXiv:2506.17208(2025)

  18. [18]

    Tural Mehtiyev and Wesley Assunção. 2026. Supplementary material uploaded as part of the paper. https://zenodo.org/records/19351830?preview=1&token=eyJh bGciOiJIUzUxMiJ9.eyJpZCI6IjRhN2QwMzAwLTEwYWItNDJiYS1hMDgxLWV mZmI3MDI1ZjQwNyIsImRhdGEiOnt9LCJyYW5kb20iOiJlMDMyNThjYzQ5Z Dc5MjkyNDdjODg2ZGU1YTJjZGUxMCJ9.DfDdG85SXug_ZUp1UbkhJuvzaH xWczC90r8esizwPmdZL-6h9xQ...

  19. [19]

    Xiangxin Meng, Zexiong Ma, Pengfei Gao, and Chao Peng. 2024. An Empir- ical Study on LLM-based Agents for Automated Bug Fixing.arXiv preprint arXiv:2411.10213(2024)

  20. [20]

    OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/intro ducing-swe-bench-verified/. 93 human annotators provided difficulty estimates for each of the 500 tasks. Annotation instructions: https://cdn.openai.com/intro ducing-swe-bench-verified/swe-b-annotation-instructions.pdf

  21. [21]

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I Learned to Start Worrying about Prompt Formatting. In12th International Conference on Learning Representations (ICLR)

  22. [22]

    SWE-Bench Pro Team. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?arXiv preprint arXiv:2509.16941(2025)

  23. [23]

    SWE-bench Team. 2025. SWE-bench Verified Leaderboard. https://www.sweben ch.com. Accessed March 2026

  24. [24]

    SWE-Compass Team. 2025. SWE-Compass: Towards Unified Evaluation of Agen- tic Coding Abilities.arXiv preprint arXiv:2511.05459(2025)

  25. [25]

    Lei Wang, Chen Ma, Xueyang Feng, et al. 2024. A Survey on Large Language Model based Autonomous Agents.Frontiers of Computer Science18, 6 (2024), 186345

  26. [26]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, et al

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, et al. 2025. OpenHands: An Open Plat- form for AI Software Developers as Generalist Agents. InInternational Conference on Learning Representations (ICLR)

  27. [27]

    Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, and Andreas Zeller. 2007. How Long Will It Take to Fix This Bug?. InProceedings of the Fourth International Workshop on Mining Software Repositories (MSR). doi:10.1109/MSR.2007.13

  28. [28]

    Chunqiu Steven Xia et al. 2025. Agentless: Demystifying LLM-based Software En- gineering Agents. InACM International Conference on the Foundations of Software Engineering (FSE)

  29. [29]

    Jimenez, et al

    John Yang, Carlos E. Jimenez, et al. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems (NeurIPS)

  30. [30]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

  31. [31]

    Daoguang Zan et al . 2025. Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving. OpenReview, under review

  32. [32]

    Kexun Zhang, Weiran Yao, Zuxin Liu, et al. 2024. Diversity Empowers Intelli- gence: Integrating Expertise of Software Engineering Agents.arXiv preprint arXiv:2408.07060(2024)

  33. [33]

    Yuntong Zhang et al. 2024. AutoCodeRover: Autonomous Program Improve- ment. InACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA)

  34. [34]

    Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. 2024. ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024. 1950–1976

  35. [35]

    Thomas Zimmermann, Nachiappan Nagappan, Harald Gall, Emanuel Giger, and Brendan Murphy. 2009. Cross-project Defect Prediction: A Large Scale Experiment on Data vs. Domain vs. Process. InProceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIG- SOFT Symposium on the Foundations of Software Engineering (ESEC/FSE)....