Recognition: no theorem link
Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues
Pith reviewed 2026-05-13 03:46 UTC · model grok-4.3
The pith
LLMs resolving real GitHub issues fail most often at strategy formulation and logic synthesis, while performing best at fault localization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across the evaluated models, strategy formulation and logic synthesis constitutes the most error-prone stage in the repair pipeline, followed by problem understanding, whereas localization exhibits the lowest failure rate. This indicates that LLMs may excel at fault localization, a task traditionally regarded as one of the most formidable challenges in automated program repair. The analysis further reveals that evaluation harnesses occasionally misjudge correct patches due to superficial discrepancies or hidden constraints.
What carries the argument
A unified five-stage taxonomy of the repair pipeline that categorizes failure symptoms and root causes across problem understanding, localization, strategy formulation and logic synthesis, and other stages.
If this is right
- LLMs show relative strength at localizing faults within codebases compared with other repair stages.
- Future LLM improvements for issue resolution should prioritize better strategy planning and logic synthesis capabilities.
- Model selection matters for balancing success rates against robustness and cost when failures occur.
- Existing evaluation harnesses require refinement to reduce false negatives on patches that are functionally correct.
Where Pith is reading between the lines
- Systems that combine LLMs with dedicated localization tools could leverage the observed strength in that stage.
- The stage-wise failure pattern may appear in other code-reasoning tasks beyond GitHub issue resolution.
- Explicit strategy-planning prompts or intermediate reasoning checkpoints could reduce the dominant error type.
Load-bearing premise
The manual categorization of the 243 failures into the five-stage taxonomy accurately identifies root causes without bias from the analysts' interpretations or the specific models tested.
What would settle it
An independent re-categorization of the same or a new set of LLM repair failures on GitHub issues that finds localization or another stage produces the highest error rate instead.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly deployed to resolve real-world GitHub issues. However, despite their potential, the specific failure modes of these models in complex repair tasks remain poorly understood. To characterize how LLM behavior diverges from human developer practices, this paper evaluates three state-of-the-art models, i.e., Claude 4.5 Sonnet, Gemini 3 Pro, and GPT-5, on the SWE-bench Verified dataset. We conduct a rigorous manual analysis of the symptoms and root causes underlying 243 failed attempts across 900 total trials. Our investigation first yields a unified failure taxonomy encompassing five distinct stages of the repair pipeline, within which we categorize typical failure symptoms and their prevalence. Secondly, our findings reveal that for all evaluated LLMs, strategy formulation and logic synthesis constitutes the most error-prone stage, followed by problem understanding, whereas localization exhibits the lowest failure rate. This suggests that LLMs may excel at fault localization, a task traditionally regarded as one of the most formidable challenges in automated program repair. Furthermore, we observe that robustness and operational costs (particularly in failure scenarios) vary significantly across different models. Finally, we uncover the root causes of these failures and propose actionable strategies to mitigate them. A particularly notable finding is that existing evaluation harnesses occasionally misjudge correct patches due to superficial discrepancies or hidden constraints. Collectively, our insights may provide promising directions for enhancing the effectiveness and reliability of LLM-based issue resolution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates three LLMs (Claude 4.5 Sonnet, Gemini 3 Pro, GPT-5) on the SWE-bench Verified dataset across 900 trials, manually analyzes 243 failures, and derives a five-stage failure taxonomy for the LLM-based repair pipeline. It claims that strategy formulation and logic synthesis is the most error-prone stage for all models, followed by problem understanding, while localization shows the lowest failure rate. Additional findings address model differences in robustness and costs, root causes with mitigation strategies, and occasional misjudgments by existing evaluation harnesses.
Significance. If the categorization holds, the work offers concrete, actionable insights into LLM limitations for real-world GitHub issue resolution, highlighting that LLMs may already handle localization effectively (contrary to traditional APR assumptions) but struggle with higher-level strategy and logic. The manual analysis of a verified benchmark and the identification of harness misjudgments are strengths that could guide future LLM-based APR improvements.
major comments (2)
- [Manual analysis and taxonomy section] The section describing the manual analysis and taxonomy construction (referenced in the abstract as the 'rigorous manual analysis' of 243 cases yielding the 'unified failure taxonomy'): no details are provided on the annotation protocol, including number of annotators, whether annotations were performed independently, how disagreements were resolved, or any inter-rater reliability metric such as Cohen's kappa. This is load-bearing because the central empirical claim—the prevalence ordering with strategy/logic synthesis as most error-prone, followed by problem understanding and localization lowest—rests entirely on these post-hoc assignments.
- [Results on failure stages] The results section reporting stage prevalences: the taxonomy appears induced from the same 243 failure cases rather than applied from a pre-defined, independently validated scheme. Without an a priori taxonomy or cross-validation, systematic interpretive bias in stage assignment (e.g., coding ambiguous cases as 'strategy' vs. 'problem understanding') could directly alter or reverse the reported ordering across the three models.
minor comments (2)
- [Abstract] The abstract states '900 total trials' but does not break down the number of attempts per model or per issue; adding this table or clarification would improve reproducibility.
- [Discussion of evaluation harnesses] The claim that 'existing evaluation harnesses occasionally misjudge correct patches' is noted as notable but lacks a specific count or examples of such misjudgments in the provided summary; a dedicated table or subsection would strengthen this observation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the potential impact of our findings on LLM-based automated program repair. We address each major comment below and have revised the manuscript accordingly to improve methodological transparency.
read point-by-point responses
-
Referee: [Manual analysis and taxonomy section] The section describing the manual analysis and taxonomy construction (referenced in the abstract as the 'rigorous manual analysis' of 243 cases yielding the 'unified failure taxonomy'): no details are provided on the annotation protocol, including number of annotators, whether annotations were performed independently, how disagreements were resolved, or any inter-rater reliability metric such as Cohen's kappa. This is load-bearing because the central empirical claim—the prevalence ordering with strategy/logic synthesis as most error-prone, followed by problem understanding and localization lowest—rests entirely on these post-hoc assignments.
Authors: We agree that the original manuscript lacked sufficient detail on the annotation protocol. The analysis was performed by the first two authors. Both independently coded a random sample of 50 failure cases to iteratively develop the taxonomy through discussion. Disagreements were resolved via consensus meetings. The finalized taxonomy was then applied to the full set of 243 cases by the first author, with the second author independently reviewing a 20% random subset for consistency. We did not compute Cohen's kappa because the taxonomy was developed collaboratively and iteratively rather than through fully independent coding of a fixed scheme. We have added a new subsection to the revised manuscript describing this protocol, including sample sizes, the iterative process, and resolution method. This addition directly supports the reliability of the reported prevalence ordering. revision: yes
-
Referee: [Results on failure stages] The results section reporting stage prevalences: the taxonomy appears induced from the same 243 failure cases rather than applied from a pre-defined, independently validated scheme. Without an a priori taxonomy or cross-validation, systematic interpretive bias in stage assignment (e.g., coding ambiguous cases as 'strategy' vs. 'problem understanding') could directly alter or reverse the reported ordering across the three models.
Authors: We acknowledge that the taxonomy was derived inductively from the 243 cases, which is appropriate for characterizing novel failure patterns in LLM repair pipelines. To reduce the risk of bias, the five stages were explicitly mapped to the standard sequential phases of LLM-based issue resolution (problem understanding, localization, strategy formulation and logic synthesis, patch generation, and validation). We have revised the manuscript to: (1) detail the inductive coding process, (2) provide concrete examples of ambiguous cases and their classifications with justifications, and (3) discuss why the observed ordering is consistent across all three models and aligns with the raw failure symptoms. While an a priori taxonomy was not used, these changes enable readers to assess potential interpretive effects and the robustness of the prevalence results. revision: yes
Circularity Check
No circularity: purely observational taxonomy from manual failure analysis
full rationale
The paper conducts a manual review of 243 failed LLM repair attempts on SWE-bench Verified, induces a five-stage failure taxonomy directly from those cases, and reports prevalence ordering (strategy formulation most error-prone, localization least). No equations, parameters, predictions, or first-principles derivations exist; the taxonomy is descriptive and applied to the same data by construction, which is standard qualitative practice rather than a self-referential reduction of any quantitative claim. No self-citation load-bearing steps, fitted inputs renamed as predictions, or ansatzes appear. The work is self-contained as an empirical characterization without any derivation chain that collapses to its inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Repository grouping.Tasks belonging to the same repository are grouped and analyzed together. This reduces context switching and allows reviewers to develop a repository- specific understanding of coding conventions, architectural patterns, and implicit design assumptions
-
[2]
Difficulty ordering.Each task contains exactly nine repair attempts (3 models⇥3 trials). Within each repository, tasks are analyzed in ascending order of total failure count, starting from tasks with fewer failed attempts and progressing to more difficult ones. This ordering helps analysts gradually build contextual familiarity before examining more complex...
-
[3]
Patch comparison.For each task, we systematically compare all failed generated patches against the reference patch. We focus on structural and logic differences, includ- ing modified files, insertion locations, content changes, and whether the introduced functionality is behaviorally equivalent to the reference implementation
-
[4]
Context reconstruction.When patch comparison alone is insufficient to explain the failure, we reconstruct the repository state using the providedbase_commitand inspect the exact historical code context. This enables analysis of sur- rounding control flow, hidden dependencies, and framework- specific constraints that may not be visible from the patch alone
-
[5]
Trajectory attribution.We trace the agent’s rea- soning process step by step using the complete execution logs generated by mini-SWE-agent. By aligning these sequential trajectories with the reconstructed repository context, we iden- tify the earliest point at which reasoning or execution deviates from a correct repair path
-
[6]
Harness diagnosis.In cases where the agent suc- cessfully passes its self-generated reproduction script but fails the official SWE-bench evaluation, we further inspect the benchmark harness. A patch is considered semantically correct only when multiple reviewers confirm that it satisfies the issue description and aligns with the intent of the reference soluti...
-
[7]
Failure labeling.For each failed attempt, we identify the earliest causally dominant breakdown point in the interac- tion trajectory and assign a single failure label corresponding to that root cause. Although multiple downstream errors may appear during execution, we use single-label attribution to preserve mutual exclusivity and avoid double-counting. T...
-
[9]
Target Bug (np.float16, kind’f’) Trigger cast ? No!Preserved (Fixes local bug)
-
[10]
Unrelated Test (bool array, kind’b’) Trigger cast ? Yes!Forced Cast 7Breaks global test
-
[11]
Unrelated Test (bool array, kind’b’) Trigger cast ? No!Preserved 3Passes global test Figure 1: Side eect comparison in astropy-8872. Abstract Large language models are increasingly employed for repository- level coding tasks, yet the root causes of their failures remain poorly understood. Current binary evaluation metrics compress intricate reasoning wor...
-
[12]
Immediate eval in__init__ Backups original ref
Initialization Lazy eval via@property Creates a proxy object. Immediate eval in__init__ Backups original ref
- [13]
-
[14]
Serialization Unreachable aer crash. 3Pass: Uses backup ref. Figure 1: Execution timing mismatch indjango-13343. 1 Fig. 14. Execution timing mismatch indjango-13343. V4: Execution Timing Mismatch.This category cap- tures failures where a syntactically valid patch executes its logic at the wrong point in the program’s runtime lifecycle. Models frequently ...
-
[15]
P1: Implicit Rules and Knowledge Boundaries:Task Features:These tasks are characterized by information vac- uums. They depend on domain-specific rules (e.g., crypto- graphic protocols, complex formatting specifications) absent from both the codebase and the issue description.Insight: Textual hints are insufficient. Models cannot reason their way through missi...
-
[16]
P2: Textual Distractors and Alignment Bias: Task Features:These tasks contain surface-level seductions, which refer to highly visible but incorrect solution hints within bug reports or legacy TODO comments.Insight: Models exhibit a strong alignment bias, prioritizing explicit human suggestions over implicit system logic.Actionable Strategy:Implement a log...
-
[17]
L1: Structural Dispersal and Boundary Traps:Task Features:These tasks involve scattered impact zones where logic is spread across disconnected modules (e.g., base com- ponents and isolated backends).Insight:A narrow problem description acts as a boundary trap, causing the agent to focus on a local visible fix while ignoring identical defects in dis- tant d...
-
[18]
Failures often manifest as isolated crashes far from the root cause
S1: Functional Symmetry and State Propagation:Task Features:Defined by implicit contracts, these tasks involve paired operations where a change in one side demands a reciprocal update. Failures often manifest as isolated crashes far from the root cause. Insight:Agents tend to apply superficial patches at the crash site rather than maintaining the integrity ...
-
[19]
S2: Foundational Coupling and Ripple Effects:Task Features:These tasks touch the architectural bedrock, which refers to low-level parsers or query builders with massive downstream dependencies.Insight:The risk of regression is extreme because the core changes affect hundreds of modules. Actionable Strategy:For core-component modifications, agents should prio...
-
[20]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
V1&V2: The Oracle-Specification Gap: Task Features:A profound mismatch between flexible user expectations and inflexible test oracles. Tests often demand absolute equality of internal types or string formats that are never specified in the task.Insight:These are artificial failures, and the repair is functionally correct but violates a hidden, arbitrary design...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.