arxiv: 2605.12270 · v1 · submitted 2026-05-12 · 💻 cs.SE

Recognition: no theorem link

Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues

Yanjie Jiang , Yian Huang , Guancheng Wang , Junjie Chen , Hui Liu , Lionel Briand

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:46 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM failure modesGitHub issue resolutionautomated program repairrepair pipeline stagesstrategy formulationfault localizationevaluation harness

0 comments

The pith

LLMs resolving real GitHub issues fail most often at strategy formulation and logic synthesis, while performing best at fault localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates three leading large language models on resolving real-world GitHub issues and manually examines 243 failed repair attempts to build a five-stage taxonomy of the repair process. It measures where each stage breaks down and finds that strategy formulation and logic synthesis produces the most errors for every model, with problem understanding as the second most common failure point. Localization shows the lowest failure rate, suggesting LLMs handle this traditionally difficult task more reliably than expected. The study also documents differences in model robustness and operational costs during failures, and notes that some existing evaluation setups misclassify correct patches due to superficial issues. From these patterns the authors derive root causes and practical mitigation steps for improving LLM-based repairs.

Core claim

Across the evaluated models, strategy formulation and logic synthesis constitutes the most error-prone stage in the repair pipeline, followed by problem understanding, whereas localization exhibits the lowest failure rate. This indicates that LLMs may excel at fault localization, a task traditionally regarded as one of the most formidable challenges in automated program repair. The analysis further reveals that evaluation harnesses occasionally misjudge correct patches due to superficial discrepancies or hidden constraints.

What carries the argument

A unified five-stage taxonomy of the repair pipeline that categorizes failure symptoms and root causes across problem understanding, localization, strategy formulation and logic synthesis, and other stages.

If this is right

LLMs show relative strength at localizing faults within codebases compared with other repair stages.
Future LLM improvements for issue resolution should prioritize better strategy planning and logic synthesis capabilities.
Model selection matters for balancing success rates against robustness and cost when failures occur.
Existing evaluation harnesses require refinement to reduce false negatives on patches that are functionally correct.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems that combine LLMs with dedicated localization tools could leverage the observed strength in that stage.
The stage-wise failure pattern may appear in other code-reasoning tasks beyond GitHub issue resolution.
Explicit strategy-planning prompts or intermediate reasoning checkpoints could reduce the dominant error type.

Load-bearing premise

The manual categorization of the 243 failures into the five-stage taxonomy accurately identifies root causes without bias from the analysts' interpretations or the specific models tested.

What would settle it

An independent re-categorization of the same or a new set of LLM repair failures on GitHub issues that finds localization or another stage produces the highest error rate instead.

Figures

Figures reproduced from arXiv: 2605.12270 by Guancheng Wang, Hui Liu, Junjie Chen, Lionel Briand, Yanjie Jiang, Yian Huang.

**Figure 1.** Figure 1: Workflow of Our Empirical Study. • Phase 1: Data Preparation. The process initiates with the construction of our experimental corpus. We perform random sampling from the SWE-bench Verified dataset [37] to ensure a representative task subset (Section III-B). • Phase 2: Autonomous Execution and Evaluation. This corpus is processed by the agent execution framework (Section III-C), which orchestrates three sta… view at source ↗

**Figure 1.** Figure 1: The failure d Fig. 2. The failure diagnosis workflow. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the diagnostic workflow in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Impact of misleading textual hints on patch generation in scikit-learn [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: Comparison of body and header cleanup handling in django-16502. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Localization failure in django-16560. mathematical requirement regardless of the underlying data distribution. This contrast underscores a pivotal gap. While LLMs excel at instruction-following, they lack the conceptual depth and critical skepticism required to navigate misleading context in complex engineering scenarios. 2) Localization: Failures at this stage occur when the model correctly understands th… view at source ↗

**Figure 1.** Figure 1: Bidirectional inconsistency in astropy-14182. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗

**Figure 1.** Figure 1: Comparison of repair strategies in sphinx-11510. 28 [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗

**Figure 11.** Figure 11: Specification gap in django-13023. Semantic Equivalence vs. Rigid Type Assertion Claude Output: Test Expects: str(max_length) (String) max_length (Integer) 3 Functional Execution: Both render identical HTML attributes. 7 Benchmark Oracle: Strict internal type assertion fails [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 1.** Figure 1: Side e"ect comparison in astropy-8872. Abstract Large language models are increasingly employed for repositorylevel coding tasks, yet the root causes of their failures remain poorly understood. Current binary evaluation metrics compress intricate reasoning work!ows into a single pass or fail score. Such reduction obscures the exact moments where models deviate from human logic. To bridge this gap, we cond… view at source ↗

**Figure 15.** Figure 15: Heatmap of failure mode distributions [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

**Figure 16.** Figure 16: Execution consistency Comparison. only in total failure counts (Claude: 70, Gemini: 82, GPT: 91) but also in their error distributions across the repair lifecycle. Claude’s failures are relatively balanced between Problem Understanding (32.9%) and Strategy & Logic (31.4%), suggesting its limitations arise equally from task interpretation and solution formulation. In contrast, Gemini and GPT exhibit a cle… view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly deployed to resolve real-world GitHub issues. However, despite their potential, the specific failure modes of these models in complex repair tasks remain poorly understood. To characterize how LLM behavior diverges from human developer practices, this paper evaluates three state-of-the-art models, i.e., Claude 4.5 Sonnet, Gemini 3 Pro, and GPT-5, on the SWE-bench Verified dataset. We conduct a rigorous manual analysis of the symptoms and root causes underlying 243 failed attempts across 900 total trials. Our investigation first yields a unified failure taxonomy encompassing five distinct stages of the repair pipeline, within which we categorize typical failure symptoms and their prevalence. Secondly, our findings reveal that for all evaluated LLMs, strategy formulation and logic synthesis constitutes the most error-prone stage, followed by problem understanding, whereas localization exhibits the lowest failure rate. This suggests that LLMs may excel at fault localization, a task traditionally regarded as one of the most formidable challenges in automated program repair. Furthermore, we observe that robustness and operational costs (particularly in failure scenarios) vary significantly across different models. Finally, we uncover the root causes of these failures and propose actionable strategies to mitigate them. A particularly notable finding is that existing evaluation harnesses occasionally misjudge correct patches due to superficial discrepancies or hidden constraints. Collectively, our insights may provide promising directions for enhancing the effectiveness and reliability of LLM-based issue resolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps LLM failures on real GitHub issues to a five-stage pipeline and ranks strategy/logic as the top problem, but the manual labeling of 243 cases has no reported agreement checks.

read the letter

The core finding is that strategy formulation and logic synthesis is the most common failure point for these models on SWE-bench, with localization turning out to be the least problematic stage. That runs counter to older automated repair literature and could shift where people focus their fixes. The paper also flags that some benchmark harnesses reject valid patches over minor formatting or hidden constraints, which is a practical side note worth keeping in mind when running evaluations. They compare Claude 4.5 Sonnet, Gemini 3 Pro, and GPT-5 on robustness and cost in failure cases, which adds a bit of operational detail. The unified taxonomy itself is new in this specific form and comes from direct inspection of 243 failed runs out of 900 total trials. That gives readers concrete symptoms and root causes tied to each pipeline stage. The work is observational and stays close to the data without overclaiming causal links or new algorithms. The soft spot is the manual categorization step. The analysis induces the five stages from the same set of cases and reports no inter-rater reliability numbers, no blinding protocol, and no description of how disagreements were settled. That leaves open the chance that interpretive choices inflated the strategy stage or downplayed others. If the full text supplies those details or sensitivity checks, the prevalence ordering would land more solidly. Readers working on LLM agents for code repair or empirical studies of AI in software engineering will get the most out of it. The dataset is public and the topic is active, so the paper is worth a referee's time even with the current methodological gaps. I would send it for review and ask the authors to document the labeling process or run a second pass with agreement metrics.

Referee Report

2 major / 2 minor

Summary. The paper evaluates three LLMs (Claude 4.5 Sonnet, Gemini 3 Pro, GPT-5) on the SWE-bench Verified dataset across 900 trials, manually analyzes 243 failures, and derives a five-stage failure taxonomy for the LLM-based repair pipeline. It claims that strategy formulation and logic synthesis is the most error-prone stage for all models, followed by problem understanding, while localization shows the lowest failure rate. Additional findings address model differences in robustness and costs, root causes with mitigation strategies, and occasional misjudgments by existing evaluation harnesses.

Significance. If the categorization holds, the work offers concrete, actionable insights into LLM limitations for real-world GitHub issue resolution, highlighting that LLMs may already handle localization effectively (contrary to traditional APR assumptions) but struggle with higher-level strategy and logic. The manual analysis of a verified benchmark and the identification of harness misjudgments are strengths that could guide future LLM-based APR improvements.

major comments (2)

[Manual analysis and taxonomy section] The section describing the manual analysis and taxonomy construction (referenced in the abstract as the 'rigorous manual analysis' of 243 cases yielding the 'unified failure taxonomy'): no details are provided on the annotation protocol, including number of annotators, whether annotations were performed independently, how disagreements were resolved, or any inter-rater reliability metric such as Cohen's kappa. This is load-bearing because the central empirical claim—the prevalence ordering with strategy/logic synthesis as most error-prone, followed by problem understanding and localization lowest—rests entirely on these post-hoc assignments.
[Results on failure stages] The results section reporting stage prevalences: the taxonomy appears induced from the same 243 failure cases rather than applied from a pre-defined, independently validated scheme. Without an a priori taxonomy or cross-validation, systematic interpretive bias in stage assignment (e.g., coding ambiguous cases as 'strategy' vs. 'problem understanding') could directly alter or reverse the reported ordering across the three models.

minor comments (2)

[Abstract] The abstract states '900 total trials' but does not break down the number of attempts per model or per issue; adding this table or clarification would improve reproducibility.
[Discussion of evaluation harnesses] The claim that 'existing evaluation harnesses occasionally misjudge correct patches' is noted as notable but lacks a specific count or examples of such misjudgments in the provided summary; a dedicated table or subsection would strengthen this observation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the potential impact of our findings on LLM-based automated program repair. We address each major comment below and have revised the manuscript accordingly to improve methodological transparency.

read point-by-point responses

Referee: [Manual analysis and taxonomy section] The section describing the manual analysis and taxonomy construction (referenced in the abstract as the 'rigorous manual analysis' of 243 cases yielding the 'unified failure taxonomy'): no details are provided on the annotation protocol, including number of annotators, whether annotations were performed independently, how disagreements were resolved, or any inter-rater reliability metric such as Cohen's kappa. This is load-bearing because the central empirical claim—the prevalence ordering with strategy/logic synthesis as most error-prone, followed by problem understanding and localization lowest—rests entirely on these post-hoc assignments.

Authors: We agree that the original manuscript lacked sufficient detail on the annotation protocol. The analysis was performed by the first two authors. Both independently coded a random sample of 50 failure cases to iteratively develop the taxonomy through discussion. Disagreements were resolved via consensus meetings. The finalized taxonomy was then applied to the full set of 243 cases by the first author, with the second author independently reviewing a 20% random subset for consistency. We did not compute Cohen's kappa because the taxonomy was developed collaboratively and iteratively rather than through fully independent coding of a fixed scheme. We have added a new subsection to the revised manuscript describing this protocol, including sample sizes, the iterative process, and resolution method. This addition directly supports the reliability of the reported prevalence ordering. revision: yes
Referee: [Results on failure stages] The results section reporting stage prevalences: the taxonomy appears induced from the same 243 failure cases rather than applied from a pre-defined, independently validated scheme. Without an a priori taxonomy or cross-validation, systematic interpretive bias in stage assignment (e.g., coding ambiguous cases as 'strategy' vs. 'problem understanding') could directly alter or reverse the reported ordering across the three models.

Authors: We acknowledge that the taxonomy was derived inductively from the 243 cases, which is appropriate for characterizing novel failure patterns in LLM repair pipelines. To reduce the risk of bias, the five stages were explicitly mapped to the standard sequential phases of LLM-based issue resolution (problem understanding, localization, strategy formulation and logic synthesis, patch generation, and validation). We have revised the manuscript to: (1) detail the inductive coding process, (2) provide concrete examples of ambiguous cases and their classifications with justifications, and (3) discuss why the observed ordering is consistent across all three models and aligns with the raw failure symptoms. While an a priori taxonomy was not used, these changes enable readers to assess potential interpretive effects and the robustness of the prevalence results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational taxonomy from manual failure analysis

full rationale

The paper conducts a manual review of 243 failed LLM repair attempts on SWE-bench Verified, induces a five-stage failure taxonomy directly from those cases, and reports prevalence ordering (strategy formulation most error-prone, localization least). No equations, parameters, predictions, or first-principles derivations exist; the taxonomy is descriptive and applied to the same data by construction, which is standard qualitative practice rather than a self-referential reduction of any quantitative claim. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatzes appear. The work is self-contained as an empirical characterization without any derivation chain that collapses to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical characterization study; the abstract mentions no free parameters, mathematical axioms, or newly invented entities.

pith-pipeline@v0.9.0 · 5575 in / 1032 out tokens · 89925 ms · 2026-05-13T03:46:44.860465+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

This reduces context switching and allows reviewers to develop a repository- speciﬁc understanding of coding conventions, architectural patterns, and implicit design assumptions

Repository grouping.Tasks belonging to the same repository are grouped and analyzed together. This reduces context switching and allows reviewers to develop a repository- speciﬁc understanding of coding conventions, architectural patterns, and implicit design assumptions

work page
[2]

Within each repository, tasks are analyzed in ascending order of total failure count, starting from tasks with fewer failed attempts and progressing to more diﬃcult ones

Difficulty ordering.Each task contains exactly nine repair attempts (3 models⇥3 trials). Within each repository, tasks are analyzed in ascending order of total failure count, starting from tasks with fewer failed attempts and progressing to more diﬃcult ones. This ordering helps analysts gradually build contextual familiarity before examining more complex...

work page
[3]

Patch comparison.For each task, we systematically compare all failed generated patches against the reference patch. We focus on structural and logic diﬀerences, includ- ing modiﬁed ﬁles, insertion locations, content changes, and whether the introduced functionality is behaviorally equivalent to the reference implementation

work page
[4]

This enables analysis of sur- rounding control ﬂow, hidden dependencies, and framework- speciﬁc constraints that may not be visible from the patch alone

Context reconstruction.When patch comparison alone is insuﬃcient to explain the failure, we reconstruct the repository state using the providedbase_commitand inspect the exact historical code context. This enables analysis of sur- rounding control ﬂow, hidden dependencies, and framework- speciﬁc constraints that may not be visible from the patch alone

work page
[5]

Trajectory attribution.We trace the agent’s rea- soning process step by step using the complete execution logs generated by mini-SWE-agent. By aligning these sequential trajectories with the reconstructed repository context, we iden- tify the earliest point at which reasoning or execution deviates from a correct repair path

work page
[6]

A patch is considered semantically correct only when multiple reviewers conﬁrm that it satisﬁes the issue description and aligns with the intent of the reference solution

Harness diagnosis.In cases where the agent suc- cessfully passes its self-generated reproduction script but fails the oﬃcial SWE-bench evaluation, we further inspect the benchmark harness. A patch is considered semantically correct only when multiple reviewers conﬁrm that it satisﬁes the issue description and aligns with the intent of the reference soluti...

work page
[7]

strict format mismatch

Failure labeling.For each failed attempt, we identify the earliest causally dominant breakdown point in the interac- tion trajectory and assign a single failure label corresponding to that root cause. Although multiple downstream errors may appear during execution, we use single-label attribution to preserve mutual exclusivity and avoid double-counting. T...

work page
[9]

Target Bug (np.float16, kind’f’) Trigger cast ? No!Preserved (Fixes local bug)

work page
[10]

Unrelated Test (bool array, kind’b’) Trigger cast ? Yes!Forced Cast 7Breaks global test

work page
[11]

Abstract Large language models are increasingly employed for repository- level coding tasks, yet the root causes of their failures remain poorly understood

Unrelated Test (bool array, kind’b’) Trigger cast ? No!Preserved 3Passes global test Figure 1: Side eect comparison in astropy-8872. Abstract Large language models are increasingly employed for repository- level coding tasks, yet the root causes of their failures remain poorly understood. Current binary evaluation metrics compress intricate reasoning wor...

work page
[12]

Immediate eval in__init__ Backups original ref

Initialization Lazy eval via@property Creates a proxy object. Immediate eval in__init__ Backups original ref

work page
[13]

3Pass: Accesses instance

Reflection 7Crash: Accesses proxy. 3Pass: Accesses instance

work page
[14]

3Pass: Uses backup ref

Serialization Unreachable aer crash. 3Pass: Uses backup ref. Figure 1: Execution timing mismatch indjango-13343. 1 Fig. 14. Execution timing mismatch indjango-13343. V4: Execution Timing Mismatch.This category cap- tures failures where a syntactically valid patch executes its logic at the wrong point in the program’s runtime lifecycle. Models frequently ...

work page
[15]

P1: Implicit Rules and Knowledge Boundaries:Task Features:These tasks are characterized by information vac- uums. They depend on domain-speciﬁc rules (e.g., crypto- graphic protocols, complex formatting speciﬁcations) absent from both the codebase and the issue description.Insight: Textual hints are insuﬃcient. Models cannot reason their way through missi...

work page
[16]

P2: Textual Distractors and Alignment Bias: Task Features:These tasks contain surface-level seductions, which refer to highly visible but incorrect solution hints within bug reports or legacy TODO comments.Insight: Models exhibit a strong alignment bias, prioritizing explicit human suggestions over implicit system logic.Actionable Strategy:Implement a log...

work page
[17]

L1: Structural Dispersal and Boundary Traps:Task Features:These tasks involve scattered impact zones where logic is spread across disconnected modules (e.g., base com- ponents and isolated backends).Insight:A narrow problem description acts as a boundary trap, causing the agent to focus on a local visible ﬁx while ignoring identical defects in dis- tant d...

work page
[18]

Failures often manifest as isolated crashes far from the root cause

S1: Functional Symmetry and State Propagation:Task Features:Deﬁned by implicit contracts, these tasks involve paired operations where a change in one side demands a reciprocal update. Failures often manifest as isolated crashes far from the root cause. Insight:Agents tend to apply superﬁcial patches at the crash site rather than maintaining the integrity ...

work page
[19]

S2: Foundational Coupling and Ripple Eﬀects:Task Features:These tasks touch the architectural bedrock, which refers to low-level parsers or query builders with massive downstream dependencies.Insight:The risk of regression is extreme because the core changes aﬀect hundreds of modules. Actionable Strategy:For core-component modiﬁcations, agents should prio...

work page
[20]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

V1&V2: The Oracle-Speciﬁcation Gap: Task Features:A profound mismatch between ﬂexible user expectations and inﬂexible test oracles. Tests often demand absolute equality of internal types or string formats that are never speciﬁed in the task.Insight:These are artiﬁcial failures, and the repair is functionally correct but violates a hidden, arbitrary design...

work page internal anchor Pith review Pith/arXiv arXiv 2023