pith. sign in

arxiv: 2606.30963 · v1 · pith:DOJN7YHFnew · submitted 2026-06-29 · 💻 cs.SE · cs.AI

Loc2Repair: A Framework for Evaluating the Impact of File-Level Issue Localization in Repo-Level LLM Repair

Pith reviewed 2026-07-01 01:02 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords file-level localizationrepo-level repairLLM repairevaluation frameworkSWE-bench Verifiedautomated program repairmodular pipelineissue localization
0
0 comments X

The pith

Explicit file-level localization improves resolved rates in repository-level LLM repair from 44.7% to 52.4%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Loc2Repair, a modular evaluation framework that decouples file-level issue localization from the repair process in repository-grounded automated repair. It uses this to test the impact of localization by comparing baseline repair without explicit localization to repair guided by predicted or gold file locations across three repair backbones on SWE-bench Verified. The results show consistent improvements in resolved rates and reductions in mean elapsed time when localization is provided. This allows researchers to analyze distinct failure modes in end-to-end repair pipelines under controlled conditions.

Core claim

Loc2Repair decouples localization and repair under a shared runtime, artifact schema, and evaluation harness, allowing researchers to combine different localization models and repair backbones under matched conditions. Using three repair backbones on SWE-bench Verified, we compare baseline repair without explicit localization, repair guided by predicted localization from two localizers, and repair guided by gold modified-file sets. Explicit localization consistently improves resolved rate across all backbones: pooled performance increases from 44.7% for baseline repair to 48.9% and 49.1% with predicted localization, and to 52.4% with gold localization. Localization also reduces mean elapsed

What carries the argument

The Loc2Repair framework that isolates file-level issue localization as an upstream variable by decoupling it from repair under shared conditions.

If this is right

  • Resolved rates increase with both predicted and gold localization across all tested backbones.
  • Gold localization achieves the highest pooled resolved rate of 52.4%.
  • Mean elapsed time decreases with localization guidance in paired analysis.
  • Token effects remain heterogeneous across models despite overall latency improvements.
  • Gold-guided failures reveal remaining headroom beyond localization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could explore combining multiple localizers to approach gold performance more closely.
  • The modular design makes it straightforward to swap in new localization methods for testing.
  • Similar decoupling might reveal localization benefits in other software engineering tasks involving LLMs.
  • The time savings could make repair systems more practical for large repositories if the pattern holds.

Load-bearing premise

The three repair backbones and SWE-bench Verified dataset are representative enough for the observed localization benefit to apply more broadly.

What would settle it

Observing no improvement or a drop in resolved rates when adding explicit localization in experiments with new backbones or datasets would falsify the claim that localization is a consistent repair lever.

Figures

Figures reproduced from arXiv: 2606.30963 by Mohammad Nour Al Awad, Sergey Ivanov.

Figure 1
Figure 1. Figure 1: Resolved rate versus average elapsed time by repair backbone. Upper-left is better; arrows indicate the effectiveness–latency shift under localization. Pooled paired tests show the same direction, where baseline → Pred-Qwen4B yields +4.3 points (95% CI [+1.9, +6.7], 𝑝 = 0.0006365); baseline → Pred-Gemma4E4B yields +4.5 points (95% CI [+2.1, +6.8], 𝑝 = 0.0002982); baseline → gold yields +7.7 points (95% CI … view at source ↗
read the original abstract

Repository-grounded automated repair is often reported as a single end-to-end capability, which hides distinct failure modes such as poor file targeting, incorrect patch synthesis, and failed iterative debugging. We present Loc2Repair, a modular evaluation framework for controlled analysis of repository-grounded repair pipelines, and use it to isolate file-level issue localization as an upstream variable. Loc2Repair decouples localization and repair under a shared runtime, artifact schema, and evaluation harness, allowing researchers to combine different localization models and repair backbones under matched conditions. Using three repair backbones on SWE-bench Verified, we compare baseline repair without explicit localization, repair guided by predicted localization from two localizers, and repair guided by gold modified-file sets. Explicit localization consistently improves resolved rate across all backbones: pooled performance increases from 44.7% for baseline repair to 48.9% and 49.1% with predicted localization, and to 52.4% with gold localization. Localization also reduces mean elapsed time overall: in pooled paired analysis, mean elapsed time decreases by 100.94 s and 52.25 s for the two predicted-localization settings, and by 154.45 s with gold guidance, although token effects remain heterogeneous across models. Overall, Loc2Repair shows file-level localization is a consistent repair lever, improving effectiveness and mean latency in pooled analysis, while gold-guided failures expose headroom beyond localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper introduces Loc2Repair, a modular framework that decouples file-level issue localization from repair in repository-grounded LLM repair pipelines under shared runtime, schema, and harness. On SWE-bench Verified with three repair backbones, it compares baseline repair (no explicit localization) against repair guided by two predicted localizers and by gold modified-file sets. Pooled results show resolved-rate gains from 44.7% (baseline) to 48.9%/49.1% (predicted) and 52.4% (gold), with corresponding mean elapsed-time reductions of 100.94 s, 52.25 s, and 154.45 s; gold guidance is used to expose remaining headroom.

Significance. If the empirical comparisons hold, the work demonstrates that explicit file-level localization is a consistent, actionable lever for both effectiveness and mean latency in repo-level repair. The modular decoupling under matched conditions supplies a reusable experimental scaffold for the community; the multi-backbone design and gold-localization upper bound provide concrete, falsifiable evidence rather than end-to-end black-box claims.

minor comments (1)
  1. [Evaluation] The abstract states that token effects remain heterogeneous across models; a brief per-backbone breakdown of token counts (or a supplementary table) would clarify whether the latency gains are driven primarily by fewer repair iterations or by localization overhead.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The review accurately captures the core contribution of Loc2Repair as a modular framework for isolating the effects of file-level localization under controlled conditions.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical evaluation framework (Loc2Repair) that decouples localization from repair and measures outcomes via controlled experiments on the external SWE-bench Verified benchmark across three independent repair backbones. All reported improvements (resolved rates from 44.7% baseline to 48.9/49.1% predicted and 52.4% gold; latency reductions) are direct measured results from these runs under matched runtime and harness conditions. No equations, fitted parameters, self-citations, or ansatzes appear in the derivation chain; the claims do not reduce to inputs by construction and remain self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the SWE-bench Verified benchmark and the assumption that the modular framework does not introduce confounding artifacts; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption SWE-bench Verified is an appropriate benchmark for evaluating repository-level repair performance.
    The experiments are conducted on this dataset to measure resolved rates and time.

pith-pipeline@v0.9.1-grok · 5790 in / 1264 out tokens · 50899 ms · 2026-07-01T01:02:07.582196+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 22 canonical work pages · 2 internal anchors

  1. [1]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, K. R. Narasimhan, SWE-bench: Can language models resolve real-world github issues?, in: The Twelfth International Conference on Learning Representations, 2024. URL: https://openreview.net/forum?id=VTF8yNQM66

  2. [2]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. R. Narasimhan, O. Press, SWE-agent: Agent-computer interfaces enable automated software engineering, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL: https://openreview.net/forum ?id=mXpq6ut8J3

  3. [3]

    C. S. Xia, Y. Deng, S. Dunn, L. Zhang, Demystifying llm-based software engineering agents, Proc. ACM Softw. Eng. 2 (2025). URL: https://doi.org/10.1145/3715754. doi:10.1145/3715754

  4. [4]

    Zhang, H

    Y. Zhang, H. Ruan, Z. Fan, A. Roychoudhury, Autocoderover: Autonomous program improvement, in: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, 2024, p. 1592–1604. URL: https://doi.org/10.1145/3650212.3680384. doi:10.1145/3650212.3680384

  5. [5]

    X. Wang, B. Li, Y. Song, OpenHands: An open platform for AI software developers as generalist agents, in: The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum?id=OJd3ayDDoF

  6. [6]

    SGAgent: Suggestion-Guided LLM-Based Multi-Agent Framework for Repository-Level Software Repair

    Q. Zhang, C. Gao, Y. Han, Y. Shang, C. Fang, Z. Chen, L. Xiao, Sgagent: Suggestion-guided llm- based multi-agent framework for repository-level software repair, 2026. URL: https://arxiv.org/ab s/2602.23647.arXiv:2602.23647

  7. [7]

    M. N. Al Awad, S. Ivanov, O. Tikhonova, Optimizing llm code suggestions: Feedback-driven timing with lightweight state bounds, in: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 2025, pp. 213–220. doi:10.1109/ASEW 67777.2025.00049

  8. [8]

    64 Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, and Mohammad Masudur Rahman

    I. Bouzenia, P. Devanbu, M. Pradel, Repairagent: An autonomous, llm-based agent for program repair, in: Proceedings of the IEEE/ACM 47th International Conference on Software Engineering, ICSE ’25, IEEE Press, 2025, p. 2188–2200. URL: https://doi.org/10.1109/ICSE55347.2025.00157. doi:10.1109/ICSE55347.2025.00157

  9. [9]

    X. Yin, C. Ni, S. Wang, Z. Li, L. Zeng, X. Yang, Thinkrepair: Self-directed automated program repair, in: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, 2024, p. 1274–1286. URL: https://doi.org/10.1145/3650212.3680359. doi:10.1145/3650212.3680359

  10. [10]

    C. Lee, C. S. Xia, L. Yang, J. tse Huang, Z. Zhu, L. Zhang, M. R. Lyu, Unidebugger: Hierarchical multi-agent framework for unified software debugging, 2025. URL: https://arxiv.org/abs/2404.17153. arXiv:2404.17153

  11. [11]

    J. Liu, Z. Liu, Z. Cheng, M. He, X. Shi, Y. Guo, X. Zhu, Y. Guo, Y. Wang, H. Wang, RepoDebug: Repository-level multi-task and multi-language debugging evaluation of large language models, in: Findings of the Association for Computational Linguistics: EMNLP 2025, Association for Computational Linguistics, 2025, pp. 23784–23813. URL: https://aclanthology.or...

  12. [12]

    M. S. Rashid, C. Bock, Y. Zhuang, A. Buchholz, T. B. Esler, S. Valentin, L. Franceschi, M. Wistuba, P. T. S, W. Kim, A. Deoras, G. Zappella, L. Callot, SWE-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2026. URL: https://openreview.net/forum?id=n5 77FC6CKk

  13. [13]

    F. Mu, J. Wang, L. Shi, S. Wang, S. Li, Q. Wang, ExpeRepair: Dual-memory enhanced LLM-based repository-level program repair, 2025. URL: https://arxiv.org/abs/2506.10484. doi:10.48550/arX iv.2506.10484.arXiv:2506.10484

  14. [14]

    M. V. T. Pham, H. N. Phan, H. N. Phan, C. L. Chi, T. N. Nguyen, N. D. Q. Bui, SWE-Synth: Synthesizing verifiable bug-fix data to enable large language models in resolving real-world bugs, 2025. URL: https://arxiv.org/abs/2504.14757. doi: 10.48550/arXiv.2504.14757 . arXiv:2504.14757

  15. [15]

    M. N. Al Awad, S. Ivanov, O. Tikhonova, Pre-filtering code suggestions using developer behavioral telemetry to optimize llm-assisted programming, in: Proceedings of the 40th IEEE/ACM Interna- tional Conference on Automated Software Engineering Workshops (ASEW), 2025, pp. 113–120. doi:10.1109/ASEW67777.2025.00032

  16. [16]

    R. K. Saha, M. Lease, S. Khurshid, D. E. Perry, Improving bug localization using structured infor- mation retrieval, in: Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, ASE ’13, IEEE Press, 2013, p. 345–355. URL: https://doi.org/10.1109/ASE.20 13.6693093. doi:10.1109/ASE.2013.6693093

  17. [17]

    X. Ye, R. Bunescu, C. Liu, Learning to rank relevant files for bug reports using domain knowledge, in: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, Association for Computing Machinery, New York, NY, USA, 2014, p. 689–699. URL: https://doi.org/10.1145/2635868.2635874. doi:10.1145/2635868.2635874

  18. [18]

    S. Wang, D. Lo, Amalgam+: Composing rich information sources for accurate bug lo- calization, Journal of Software: Evolution and Process 28 (2016) 921–942. URL: https: //onlinelibrary.wiley.com/doi/abs/10.1002/smr.1801. doi: 1 0 . 1 0 0 2 / s m r . 1 8 01. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/smr.1801

  19. [19]

    S. A. Akbar, A. C. Kak, A large-scale comparative evaluation of ir-based tools for bug localization, in: Proceedings of the 17th International Conference on Mining Software Repositories, MSR ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 21–31. URL: https: //doi.org/10.1145/3379597.3387474. doi:10.1145/3379597.3387474

  20. [20]

    F. Niu, C. Li, K. Liu, X. Xia, D. Lo, When deep learning meets information retrieval-based bug localization: A survey, ACM Comput. Surv. 57 (2025). URL: https://doi.org/10.1145/3734217. doi:10.1145/3734217

  21. [21]

    Z. Chen, R. Tang, G. Deng, F. Wu, J. Wu, Z. Jiang, V. Prasanna, A. Cohan, X. Wang, Locagent: Graph-guided LLM agents for code localization, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Vienna, Austria, 2025, pp. 8697–8727. URL: https://aclant...

  22. [22]

    Jiang, X

    Z. Jiang, X. Ren, M. Yan, W. Jiang, Y. Li, Z. Liu, Cosil: Software issue localization via LLM-driven code repository graph searching, CoRR abs/2503.22424 (2025). URL: https://arxiv.org/abs/2503.224 24.arXiv:2503.22424

  23. [23]

    Z. Yu, H. Zhang, Y. Zhao, H. Huang, M. Yao, K. Ding, J. Zhao, OrcaLoca: An LLM agent framework for software issue localization, in: Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, PMLR, 2025, pp. 73416–73436. URL: https://proceedings.mlr.press/v267/yu25x.html

  24. [24]

    Maarleveld, J

    J. Maarleveld, J. Guo, D. Feitosa, Gotta catch ’em all! towards file localisation from issues at large,

  25. [25]

    URL: https://arxiv.org/abs/2507.18319.arXiv:2507.18319

  26. [26]

    R. G. Reddy, T. Suresh, J. Doo, Y. Liu, X.-P. Nguyen, Y. Zhou, S. Yavuz, C. Xiong, H. Ji, S. Joty, SWERank: Software issue localization with code ranking, in: The Fourteenth International Conference on Learning Representations, 2026. URL: https://openreview.net/forum?id=OnkRqb Nhe3

  27. [27]

    S. B. Hossain, N. Jiang, Q. Zhou, X. Li, W.-H. Chiang, Y. Lyu, H. Nguyen, O. Tripp, A deep dive into large language models for automated bug localization and repair, Proc. ACM Softw. Eng. 1 (2024). URL: https://doi.org/10.1145/3660773. doi:10.1145/3660773

  28. [28]

    Q. Feng, X. Ma, J. Sheng, Z. Feng, W. Song, P. Liang, Integrating various software artifacts for better llm-based bug localization and program repair, ACM Trans. Softw. Eng. Methodol. (2025). URL: https://doi.org/10.1145/3770581. doi:10.1145/3770581, just Accepted

  29. [29]

    Sepidband, H

    M. Sepidband, H. Taherkhani, H. V. Pham, H. Hemmati, Rgfl: Reasoning guided fault localization for automated program repair using large language models, 2026. URL: https://arxiv.org/abs/2601 .18044.arXiv:2601.18044

  30. [30]

    Accessed: 2026-05-03

    SWE-agent Team, mini-SWE-agent, https://github.com/SWE-agent/mini-swe-agent, 2024. Accessed: 2026-05-03