pith. machine review for the scientific record. sign in

arxiv: 2605.12270 · v1 · submitted 2026-05-12 · 💻 cs.SE

Recognition: no theorem link

Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:46 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM failure modesGitHub issue resolutionautomated program repairrepair pipeline stagesstrategy formulationfault localizationevaluation harness
0
0 comments X

The pith

LLMs resolving real GitHub issues fail most often at strategy formulation and logic synthesis, while performing best at fault localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates three leading large language models on resolving real-world GitHub issues and manually examines 243 failed repair attempts to build a five-stage taxonomy of the repair process. It measures where each stage breaks down and finds that strategy formulation and logic synthesis produces the most errors for every model, with problem understanding as the second most common failure point. Localization shows the lowest failure rate, suggesting LLMs handle this traditionally difficult task more reliably than expected. The study also documents differences in model robustness and operational costs during failures, and notes that some existing evaluation setups misclassify correct patches due to superficial issues. From these patterns the authors derive root causes and practical mitigation steps for improving LLM-based repairs.

Core claim

Across the evaluated models, strategy formulation and logic synthesis constitutes the most error-prone stage in the repair pipeline, followed by problem understanding, whereas localization exhibits the lowest failure rate. This indicates that LLMs may excel at fault localization, a task traditionally regarded as one of the most formidable challenges in automated program repair. The analysis further reveals that evaluation harnesses occasionally misjudge correct patches due to superficial discrepancies or hidden constraints.

What carries the argument

A unified five-stage taxonomy of the repair pipeline that categorizes failure symptoms and root causes across problem understanding, localization, strategy formulation and logic synthesis, and other stages.

If this is right

  • LLMs show relative strength at localizing faults within codebases compared with other repair stages.
  • Future LLM improvements for issue resolution should prioritize better strategy planning and logic synthesis capabilities.
  • Model selection matters for balancing success rates against robustness and cost when failures occur.
  • Existing evaluation harnesses require refinement to reduce false negatives on patches that are functionally correct.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems that combine LLMs with dedicated localization tools could leverage the observed strength in that stage.
  • The stage-wise failure pattern may appear in other code-reasoning tasks beyond GitHub issue resolution.
  • Explicit strategy-planning prompts or intermediate reasoning checkpoints could reduce the dominant error type.

Load-bearing premise

The manual categorization of the 243 failures into the five-stage taxonomy accurately identifies root causes without bias from the analysts' interpretations or the specific models tested.

What would settle it

An independent re-categorization of the same or a new set of LLM repair failures on GitHub issues that finds localization or another stage produces the highest error rate instead.

Figures

Figures reproduced from arXiv: 2605.12270 by Guancheng Wang, Hui Liu, Junjie Chen, Lionel Briand, Yanjie Jiang, Yian Huang.

Figure 1
Figure 1. Figure 1: The methodology is structured into three phases: [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Workflow of Our Empirical Study. • Phase 1: Data Preparation. The process initiates with the construction of our experimental corpus. We perform random sampling from the SWE-bench Verified dataset [37] to ensure a representative task subset (Section III-B). • Phase 2: Autonomous Execution and Evaluation. This corpus is processed by the agent execution framework (Section III-C), which orchestrates three sta… view at source ↗
Figure 1
Figure 1. Figure 1: The failure d Fig. 2. The failure diagnosis workflow. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the diagnostic workflow in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of misleading textual hints on patch generation in scikit-learn [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of body and header cleanup handling in django-16502. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Localization failure in django-16560. mathematical requirement regardless of the underlying data distribution. This contrast underscores a pivotal gap. While LLMs excel at instruction-following, they lack the conceptual depth and critical skepticism required to navigate misleading context in complex engineering scenarios. 2) Localization: Failures at this stage occur when the model correctly understands th… view at source ↗
Figure 1
Figure 1. Figure 1: Bidirectional inconsistency in astropy-14182. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Comparison of repair strategies in sphinx-11510. 28 [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 11
Figure 11. Figure 11: Specification gap in django-13023. Semantic Equivalence vs. Rigid Type Assertion Claude Output: Test Expects: str(max_length) (String) max_length (Integer) 3 Functional Execution: Both render identical HTML attributes. 7 Benchmark Oracle: Strict internal type assertion fails [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 1
Figure 1. Figure 1: Side e"ect comparison in astropy-8872. Abstract Large language models are increasingly employed for repository￾level coding tasks, yet the root causes of their failures remain poorly understood. Current binary evaluation metrics compress intricate reasoning work!ows into a single pass or fail score. Such reduction obscures the exact moments where models deviate from human logic. To bridge this gap, we cond… view at source ↗
Figure 15
Figure 15. Figure 15: Heatmap of failure mode distributions [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Execution consistency Comparison. only in total failure counts (Claude: 70, Gemini: 82, GPT: 91) but also in their error distributions across the repair lifecycle. Claude’s failures are relatively balanced between Problem Understanding (32.9%) and Strategy & Logic (31.4%), sug￾gesting its limitations arise equally from task interpretation and solution formulation. In contrast, Gemini and GPT exhibit a cle… view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed to resolve real-world GitHub issues. However, despite their potential, the specific failure modes of these models in complex repair tasks remain poorly understood. To characterize how LLM behavior diverges from human developer practices, this paper evaluates three state-of-the-art models, i.e., Claude 4.5 Sonnet, Gemini 3 Pro, and GPT-5, on the SWE-bench Verified dataset. We conduct a rigorous manual analysis of the symptoms and root causes underlying 243 failed attempts across 900 total trials. Our investigation first yields a unified failure taxonomy encompassing five distinct stages of the repair pipeline, within which we categorize typical failure symptoms and their prevalence. Secondly, our findings reveal that for all evaluated LLMs, strategy formulation and logic synthesis constitutes the most error-prone stage, followed by problem understanding, whereas localization exhibits the lowest failure rate. This suggests that LLMs may excel at fault localization, a task traditionally regarded as one of the most formidable challenges in automated program repair. Furthermore, we observe that robustness and operational costs (particularly in failure scenarios) vary significantly across different models. Finally, we uncover the root causes of these failures and propose actionable strategies to mitigate them. A particularly notable finding is that existing evaluation harnesses occasionally misjudge correct patches due to superficial discrepancies or hidden constraints. Collectively, our insights may provide promising directions for enhancing the effectiveness and reliability of LLM-based issue resolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates three LLMs (Claude 4.5 Sonnet, Gemini 3 Pro, GPT-5) on the SWE-bench Verified dataset across 900 trials, manually analyzes 243 failures, and derives a five-stage failure taxonomy for the LLM-based repair pipeline. It claims that strategy formulation and logic synthesis is the most error-prone stage for all models, followed by problem understanding, while localization shows the lowest failure rate. Additional findings address model differences in robustness and costs, root causes with mitigation strategies, and occasional misjudgments by existing evaluation harnesses.

Significance. If the categorization holds, the work offers concrete, actionable insights into LLM limitations for real-world GitHub issue resolution, highlighting that LLMs may already handle localization effectively (contrary to traditional APR assumptions) but struggle with higher-level strategy and logic. The manual analysis of a verified benchmark and the identification of harness misjudgments are strengths that could guide future LLM-based APR improvements.

major comments (2)
  1. [Manual analysis and taxonomy section] The section describing the manual analysis and taxonomy construction (referenced in the abstract as the 'rigorous manual analysis' of 243 cases yielding the 'unified failure taxonomy'): no details are provided on the annotation protocol, including number of annotators, whether annotations were performed independently, how disagreements were resolved, or any inter-rater reliability metric such as Cohen's kappa. This is load-bearing because the central empirical claim—the prevalence ordering with strategy/logic synthesis as most error-prone, followed by problem understanding and localization lowest—rests entirely on these post-hoc assignments.
  2. [Results on failure stages] The results section reporting stage prevalences: the taxonomy appears induced from the same 243 failure cases rather than applied from a pre-defined, independently validated scheme. Without an a priori taxonomy or cross-validation, systematic interpretive bias in stage assignment (e.g., coding ambiguous cases as 'strategy' vs. 'problem understanding') could directly alter or reverse the reported ordering across the three models.
minor comments (2)
  1. [Abstract] The abstract states '900 total trials' but does not break down the number of attempts per model or per issue; adding this table or clarification would improve reproducibility.
  2. [Discussion of evaluation harnesses] The claim that 'existing evaluation harnesses occasionally misjudge correct patches' is noted as notable but lacks a specific count or examples of such misjudgments in the provided summary; a dedicated table or subsection would strengthen this observation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the potential impact of our findings on LLM-based automated program repair. We address each major comment below and have revised the manuscript accordingly to improve methodological transparency.

read point-by-point responses
  1. Referee: [Manual analysis and taxonomy section] The section describing the manual analysis and taxonomy construction (referenced in the abstract as the 'rigorous manual analysis' of 243 cases yielding the 'unified failure taxonomy'): no details are provided on the annotation protocol, including number of annotators, whether annotations were performed independently, how disagreements were resolved, or any inter-rater reliability metric such as Cohen's kappa. This is load-bearing because the central empirical claim—the prevalence ordering with strategy/logic synthesis as most error-prone, followed by problem understanding and localization lowest—rests entirely on these post-hoc assignments.

    Authors: We agree that the original manuscript lacked sufficient detail on the annotation protocol. The analysis was performed by the first two authors. Both independently coded a random sample of 50 failure cases to iteratively develop the taxonomy through discussion. Disagreements were resolved via consensus meetings. The finalized taxonomy was then applied to the full set of 243 cases by the first author, with the second author independently reviewing a 20% random subset for consistency. We did not compute Cohen's kappa because the taxonomy was developed collaboratively and iteratively rather than through fully independent coding of a fixed scheme. We have added a new subsection to the revised manuscript describing this protocol, including sample sizes, the iterative process, and resolution method. This addition directly supports the reliability of the reported prevalence ordering. revision: yes

  2. Referee: [Results on failure stages] The results section reporting stage prevalences: the taxonomy appears induced from the same 243 failure cases rather than applied from a pre-defined, independently validated scheme. Without an a priori taxonomy or cross-validation, systematic interpretive bias in stage assignment (e.g., coding ambiguous cases as 'strategy' vs. 'problem understanding') could directly alter or reverse the reported ordering across the three models.

    Authors: We acknowledge that the taxonomy was derived inductively from the 243 cases, which is appropriate for characterizing novel failure patterns in LLM repair pipelines. To reduce the risk of bias, the five stages were explicitly mapped to the standard sequential phases of LLM-based issue resolution (problem understanding, localization, strategy formulation and logic synthesis, patch generation, and validation). We have revised the manuscript to: (1) detail the inductive coding process, (2) provide concrete examples of ambiguous cases and their classifications with justifications, and (3) discuss why the observed ordering is consistent across all three models and aligns with the raw failure symptoms. While an a priori taxonomy was not used, these changes enable readers to assess potential interpretive effects and the robustness of the prevalence results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational taxonomy from manual failure analysis

full rationale

The paper conducts a manual review of 243 failed LLM repair attempts on SWE-bench Verified, induces a five-stage failure taxonomy directly from those cases, and reports prevalence ordering (strategy formulation most error-prone, localization least). No equations, parameters, predictions, or first-principles derivations exist; the taxonomy is descriptive and applied to the same data by construction, which is standard qualitative practice rather than a self-referential reduction of any quantitative claim. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatzes appear. The work is self-contained as an empirical characterization without any derivation chain that collapses to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical characterization study; the abstract mentions no free parameters, mathematical axioms, or newly invented entities.

pith-pipeline@v0.9.0 · 5575 in / 1032 out tokens · 89925 ms · 2026-05-13T03:46:44.860465+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    This reduces context switching and allows reviewers to develop a repository- specific understanding of coding conventions, architectural patterns, and implicit design assumptions

    Repository grouping.Tasks belonging to the same repository are grouped and analyzed together. This reduces context switching and allows reviewers to develop a repository- specific understanding of coding conventions, architectural patterns, and implicit design assumptions

  2. [2]

    Within each repository, tasks are analyzed in ascending order of total failure count, starting from tasks with fewer failed attempts and progressing to more difficult ones

    Difficulty ordering.Each task contains exactly nine repair attempts (3 models⇥3 trials). Within each repository, tasks are analyzed in ascending order of total failure count, starting from tasks with fewer failed attempts and progressing to more difficult ones. This ordering helps analysts gradually build contextual familiarity before examining more complex...

  3. [3]

    Patch comparison.For each task, we systematically compare all failed generated patches against the reference patch. We focus on structural and logic differences, includ- ing modified files, insertion locations, content changes, and whether the introduced functionality is behaviorally equivalent to the reference implementation

  4. [4]

    This enables analysis of sur- rounding control flow, hidden dependencies, and framework- specific constraints that may not be visible from the patch alone

    Context reconstruction.When patch comparison alone is insufficient to explain the failure, we reconstruct the repository state using the providedbase_commitand inspect the exact historical code context. This enables analysis of sur- rounding control flow, hidden dependencies, and framework- specific constraints that may not be visible from the patch alone

  5. [5]

    Trajectory attribution.We trace the agent’s rea- soning process step by step using the complete execution logs generated by mini-SWE-agent. By aligning these sequential trajectories with the reconstructed repository context, we iden- tify the earliest point at which reasoning or execution deviates from a correct repair path

  6. [6]

    A patch is considered semantically correct only when multiple reviewers confirm that it satisfies the issue description and aligns with the intent of the reference solution

    Harness diagnosis.In cases where the agent suc- cessfully passes its self-generated reproduction script but fails the official SWE-bench evaluation, we further inspect the benchmark harness. A patch is considered semantically correct only when multiple reviewers confirm that it satisfies the issue description and aligns with the intent of the reference soluti...

  7. [7]

    strict format mismatch

    Failure labeling.For each failed attempt, we identify the earliest causally dominant breakdown point in the interac- tion trajectory and assign a single failure label corresponding to that root cause. Although multiple downstream errors may appear during execution, we use single-label attribution to preserve mutual exclusivity and avoid double-counting. T...

  8. [9]

    Target Bug (np.float16, kind’f’) Trigger cast ? No!Preserved (Fixes local bug)

  9. [10]

    Unrelated Test (bool array, kind’b’) Trigger cast ? Yes!Forced Cast 7Breaks global test

  10. [11]

    Abstract Large language models are increasingly employed for repository- level coding tasks, yet the root causes of their failures remain poorly understood

    Unrelated Test (bool array, kind’b’) Trigger cast ? No!Preserved 3Passes global test Figure 1: Side e￿ect comparison in astropy-8872. Abstract Large language models are increasingly employed for repository- level coding tasks, yet the root causes of their failures remain poorly understood. Current binary evaluation metrics compress intricate reasoning wor...

  11. [12]

    Immediate eval in__init__ Backups original ref

    Initialization Lazy eval via@property Creates a proxy object. Immediate eval in__init__ Backups original ref

  12. [13]

    3Pass: Accesses instance

    Reflection 7Crash: Accesses proxy. 3Pass: Accesses instance

  13. [14]

    3Pass: Uses backup ref

    Serialization Unreachable a￿er crash. 3Pass: Uses backup ref. Figure 1: Execution timing mismatch indjango-13343. 1 Fig. 14. Execution timing mismatch indjango-13343. V4: Execution Timing Mismatch.This category cap- tures failures where a syntactically valid patch executes its logic at the wrong point in the program’s runtime lifecycle. Models frequently ...

  14. [15]

    P1: Implicit Rules and Knowledge Boundaries:Task Features:These tasks are characterized by information vac- uums. They depend on domain-specific rules (e.g., crypto- graphic protocols, complex formatting specifications) absent from both the codebase and the issue description.Insight: Textual hints are insufficient. Models cannot reason their way through missi...

  15. [16]

    P2: Textual Distractors and Alignment Bias: Task Features:These tasks contain surface-level seductions, which refer to highly visible but incorrect solution hints within bug reports or legacy TODO comments.Insight: Models exhibit a strong alignment bias, prioritizing explicit human suggestions over implicit system logic.Actionable Strategy:Implement a log...

  16. [17]

    L1: Structural Dispersal and Boundary Traps:Task Features:These tasks involve scattered impact zones where logic is spread across disconnected modules (e.g., base com- ponents and isolated backends).Insight:A narrow problem description acts as a boundary trap, causing the agent to focus on a local visible fix while ignoring identical defects in dis- tant d...

  17. [18]

    Failures often manifest as isolated crashes far from the root cause

    S1: Functional Symmetry and State Propagation:Task Features:Defined by implicit contracts, these tasks involve paired operations where a change in one side demands a reciprocal update. Failures often manifest as isolated crashes far from the root cause. Insight:Agents tend to apply superficial patches at the crash site rather than maintaining the integrity ...

  18. [19]

    S2: Foundational Coupling and Ripple Effects:Task Features:These tasks touch the architectural bedrock, which refers to low-level parsers or query builders with massive downstream dependencies.Insight:The risk of regression is extreme because the core changes affect hundreds of modules. Actionable Strategy:For core-component modifications, agents should prio...

  19. [20]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    V1&V2: The Oracle-Specification Gap: Task Features:A profound mismatch between flexible user expectations and inflexible test oracles. Tests often demand absolute equality of internal types or string formats that are never specified in the task.Insight:These are artificial failures, and the repair is functionally correct but violates a hidden, arbitrary design...