pith. sign in

arxiv: 2605.17965 · v1 · pith:BSKQGRZ4new · submitted 2026-05-18 · 💻 cs.SE · cs.AI

BLAgent: Agentic RAG for File-Level Bug Localization

Pith reviewed 2026-05-20 09:27 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords bug localizationagentic RAGfile-levelSWE-bench Liteautomated program repairlarge language modelscode chunkingsoftware maintenance
0
0 comments X

The pith

BLAgent's agentic RAG localizes bugs to the right file at over 78% top-1 accuracy using open-source models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BLAgent as a way to identify which file in a code repository contains a bug, a step that often limits progress on fixing code automatically or analyzing root causes. Current retrieval methods for this task are static and do not reason enough to pick the faulty file reliably. BLAgent adds three pieces: it chunks the repository while keeping path and AST structure, rewrites the bug description to pull both structural and runtime signals, and reranks a small list of candidate files first by rules then by step-by-step evidence. If these pieces work together, they let large language models ground their answers in the right code without scanning everything or running up high costs.

Core claim

BLAgent integrates code structure-aware repository encoding with path-augmented AST-based chunking, dual-perspective query transformation capturing both structural and behavioral signals, and two-phase agentic reranking combining symbolic inspection with evidence-grounded reasoning to perform accurate file-level bug localization over a compact candidate set.

What carries the argument

The agentic RAG framework with path-augmented AST chunking for repository encoding, dual-perspective query transformation, and two-phase symbolic-plus-reasoning reranking that balances accuracy and cost through bounded reasoning.

If this is right

  • BLAgent reaches over 78% top-1 accuracy with open-source models on SWE-bench Lite.
  • Accuracy exceeds 86% when a closed-source model is used instead.
  • The method runs more than 18 times cheaper than the strongest baseline that uses the same model.
  • Plugging BLAgent into an automated program repair pipeline raises the final repair success rate by over 20%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bounded-reasoning pattern could reduce wasted context in other code-search tasks such as finding functions to edit during refactoring.
  • Cost reductions of this size might let teams run file-level checks on every commit rather than only on reported bugs.
  • If the dual-perspective rewrite proves robust, it could be reused as a lightweight add-on for any retrieval system that needs both static and dynamic cues.

Load-bearing premise

The three components of path-augmented AST chunking, dual-perspective query transformation, and two-phase reranking together produce accurate reasoning over a compact set of files that works across benchmarks and models.

What would settle it

Accuracy falling well below 50 percent top-1 on a fresh set of bug reports from different repositories or languages would show the components do not deliver the claimed bounded accuracy.

Figures

Figures reproduced from arXiv: 2605.17965 by Gias Uddin, Md Afif Al Mamun.

Figure 1
Figure 1. Figure 1: An example demonstrating how incorrect file localization may lead to incorrect patch generation. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall outline of the proposed localization approach. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of naive text-based versus AST-aware code splitting. The naive splitter (a) breaks the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Query transformation of human-reported bug (Example Bug: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Reranking of the candidate files with ReAct agent. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Basic RAG pipeline for file-level localization. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Two cases illustrating how path-augmented code chunking improves retrieval similarity. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: File-level localization in dense retrieval when the correct file appears in the Top-1,3,10 locations. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of different query transformations and retrieved files. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Integration of BLAgent into another APR framework. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Overlap of repaired issues across multiple runs using different localization strategies. [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overall resolution and failure percentage at different levels of program repair stage. [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of (a) APR-generated incorrect patch, and (b) Ground-truth patch for [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of failed line level localization ( [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Generated patch with correct line level information. [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
read the original abstract

Bug localization remains a key bottleneck in downstream software maintenance tasks, including root cause analysis, triage, and automated program repair (APR), despite recent advances in large language model (LLM)-based repair systems. File-level bug localization is especially critical in hierarchical pipelines, where errors can propagate to downstream stages such as statement-level localization or patch generation. While Retrieval-Augmented Generation (RAG) offers a promising direction for grounding LLMs in repository context, existing RAG pipelines rely on static retrieval and lack the reasoning needed to identify faulty code accurately. In this work, we present BLAgent, a novel agentic RAG framework for file-level bug localization that integrates three key ideas: (i) code structure-aware repository encoding with path-augmented AST-based chunking, (ii) dual-perspective query transformation capturing both structural and behavioral signals, and (iii) two-phase agentic reranking combining symbolic inspection with evidence-grounded reasoning. Unlike prior graph-based or multi-hop agentic approaches, BLAgent performs bounded reasoning over a compact candidate set, balancing accuracy and cost. On SWE-bench Lite, BLAgent attains over 78% Top-1 accuracy with open-source models and over 86% with a closed-source model, while being over 18x cheaper than the strongest baseline using the same model. When integrated into an APR framework, it improves end-to-end repair success by over 20%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces BLAgent, an agentic RAG framework for file-level bug localization in software repositories. It proposes three components: (i) path-augmented AST-based chunking for code structure-aware encoding, (ii) dual-perspective query transformation for structural and behavioral signals, and (iii) two-phase agentic reranking with symbolic inspection and evidence-grounded reasoning. The central empirical claims are that BLAgent achieves over 78% Top-1 accuracy on SWE-bench Lite using open-source models and over 86% with closed-source models, is more than 18x cheaper than the strongest baseline with the same model, and yields over 20% improvement in end-to-end repair success when integrated into an APR pipeline.

Significance. If the reported performance gains and cost reductions are shown to be robust and attributable to the proposed mechanisms, the work would represent a meaningful advance in repository-scale bug localization. It could improve the reliability of hierarchical software maintenance pipelines and APR systems by providing a more accurate and efficient way to ground LLMs in repository context without unbounded reasoning costs.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experimental Evaluation): The manuscript reports strong Top-1 accuracy, cost reduction, and APR improvement figures but provides no ablation studies that isolate the individual contributions of path-augmented AST chunking, dual-perspective query transformation, and two-phase reranking. Without these results it is impossible to determine whether the claimed gains are caused by the agentic RAG design or by properties of the base models, prompt engineering, or the SWE-bench Lite bug distribution.
  2. [§4.1] §4.1 (Baselines and Metrics): The abstract states that BLAgent is over 18x cheaper than the strongest baseline using the same model, yet the paper supplies no description of the baseline systems, their retrieval mechanisms, or the exact cost metric (token usage, API calls, or wall-clock time). This omission prevents verification of the cost claim and its load-bearing role in the central argument.
  3. [§4.3] §4.3 (Generalization): No results are presented on repositories or benchmarks outside SWE-bench Lite. The claim that the three components produce bounded, accurate reasoning over a compact candidate set therefore rests on a single benchmark whose bug distribution may not be representative, weakening the assertion that the framework generalizes.
minor comments (2)
  1. [Abstract] The abstract and introduction use the term 'bounded reasoning' without a precise definition or complexity bound; a short paragraph clarifying what 'bounded' means in terms of candidate set size or reasoning steps would improve clarity.
  2. [Figures and Tables] Figure captions and table headers should explicitly state the models (open-source vs. closed-source) and the exact Top-1 accuracy numbers rather than relying on the prose description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment point by point below, indicating where we will revise the manuscript to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and §4] The manuscript reports strong Top-1 accuracy, cost reduction, and APR improvement figures but provides no ablation studies that isolate the individual contributions of path-augmented AST chunking, dual-perspective query transformation, and two-phase reranking. Without these results it is impossible to determine whether the claimed gains are caused by the agentic RAG design or by properties of the base models, prompt engineering, or the SWE-bench Lite bug distribution.

    Authors: We agree that explicit ablation studies are required to attribute performance gains to the proposed mechanisms. In the revised manuscript we will add a new subsection in §4 that reports results after systematically ablating each component in turn (path-augmented AST chunking, dual-perspective query transformation, and two-phase reranking) while keeping all other factors fixed. These experiments will quantify the contribution of each element to Top-1 accuracy and cost. revision: yes

  2. Referee: [§4.1] The abstract states that BLAgent is over 18x cheaper than the strongest baseline using the same model, yet the paper supplies no description of the baseline systems, their retrieval mechanisms, or the exact cost metric (token usage, API calls, or wall-clock time). This omission prevents verification of the cost claim and its load-bearing role in the central argument.

    Authors: We accept that the current description of baselines and cost measurement is insufficient. We will expand §4.1 with full specifications of every baseline, including their retrieval strategies and implementation details, and will state explicitly that cost is measured as total input plus output tokens across all LLM calls (retrieval and generation) using the same model for fair comparison. This will allow direct verification of the 18x reduction. revision: yes

  3. Referee: [§4.3] No results are presented on repositories or benchmarks outside SWE-bench Lite. The claim that the three components produce bounded, accurate reasoning over a compact candidate set therefore rests on a single benchmark whose bug distribution may not be representative, weakening the assertion that the framework generalizes.

    Authors: SWE-bench Lite is the current standard benchmark for repository-level bug localization because it consists of real GitHub issues with full repository context. Nevertheless, we acknowledge that results on additional benchmarks would strengthen the generalization argument. In the revision we will add a dedicated paragraph in §4.3 discussing the representativeness of SWE-bench Lite and will include a limitations subsection that explicitly notes the single-benchmark scope and outlines plans for future multi-benchmark evaluation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance claims on external benchmark

full rationale

The paper presents BLAgent as a novel agentic RAG framework incorporating path-augmented AST chunking, dual-perspective query transformation, and two-phase reranking, then reports empirical results on SWE-bench Lite (over 78% Top-1 with open-source models, over 86% with closed-source, 18x cheaper, and 20% APR improvement). No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the provided text. The performance figures are tied directly to an external benchmark rather than any internal reduction or self-citation chain, making the central claims self-contained empirical observations without circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about LLM reasoning capabilities over code and the representativeness of SWE-bench Lite; no free parameters, new invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption Large language models can perform reliable symbolic inspection and evidence-grounded reasoning when given compact, well-structured code context.
    The two-phase agentic reranking step depends on this capability.

pith-pipeline@v0.9.0 · 5784 in / 1317 out tokens · 29549 ms · 2026-05-20T09:27:18.545462+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 9 internal anchors

  1. [1]

    Abreu, P

    R. Abreu, P. Zoeteweij, and A. J. Van Gemund. On the accuracy of spectrum-based fault localization. InTesting: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007), pages 89–98. IEEE, 2007

  2. [2]

    M. Asad, R. M. Yasir, A. Geramirad, and S. Malek. Leveraging large language model for information retrieval-based bug localization. arXiv preprint arXiv:2508.00253, 2025

  3. [3]

    Bettenburg, S

    N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and T. Zimmermann. What makes a good bug report? InProceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 308–318, 2008

  4. [4]

    Böhme, E

    M. Böhme, E. O. Soremekun, S. Chattopadhyay, E. Ugherughe, and A. Zeller. Where is the bug and how is it fixed? an experiment with practitioners. InProceedings of the 2017 11th joint meeting on foundations of software engineering, pages 117–128, 2017

  5. [5]

    RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

    I. Bouzenia, P. Devanbu, and M. Pradel. Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134, 2024

  6. [6]

    C.-M. Chan, C. Xu, R. Yuan, H. Luo, W. Xue, Y. Guo, and J. Fu. Rq-rag: Learning to refine queries for retrieval augmented generation. arXiv preprint arXiv:2404.00610, 2024

  7. [7]

    Chang, X

    J. Chang, X. Zhou, L. Lulu, D. Lo, and B. Li. Bridging bug localization and issue fixing: A hierarchical localization framework leveraging large language models.IEEE Transactions on Software Engineering, 2026

  8. [8]

    A. R. Chen, T.-H. Chen, and S. Wang. Pathidea: Improving information retrieval-based bug localization by re-constructing execution paths using logs.IEEE Transactions on Software Engineering, 48(8):2905–2919, 2021

  9. [9]

    Z. Chen, R. Tang, G. Deng, F. Wu, J. Wu, Z. Jiang, V. Prasanna, A. Cohan, and X. Wang. Locagent: Graph-guided llm agents for code localization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8697–8727, 2025

  10. [10]

    Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan. Automated repair of programs from large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1469–1481. IEEE, 2023

  11. [11]

    T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

  12. [12]

    Huang, W

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025. , Vol. 1, No. 1, Article . Publication date: May 2026. BLAgent: Agentic RAG for File-Level Bug Loca...

  13. [13]

    Understanding the planning of LLM agents: A survey

    X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y. Wang, R. Tang, and E. Chen. Understanding the planning of llm agents: A survey.arXiv preprint arXiv:2402.02716, 2024

  14. [14]

    Jiang, X

    Z. Jiang, X. Ren, M. Yan, W. Jiang, Y. Li, and Z. Liu. Cosil: Software issue localization via llm-driven code repository graph searching. arXiv preprint arXiv:2503.22424, 2025

  15. [15]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

  16. [16]

    J. A. Jones, M. J. Harrold, and J. Stasko. Visualization of test information to assist fault localization. InProceedings of the 24th international conference on Software engineering, pages 467–477, 2002

  17. [17]

    Joshi, J

    H. Joshi, J. C. Sanchez, S. Gulwani, V. Le, G. Verbruggen, and I. Radiček. Repair is nearly generation: Multilingual program repair with llms. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5131–5140, 2023

  18. [18]

    R. Just, D. Jalali, and M. D. Ernst. Defects4j: a database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, ISSTA 2014, page 437–440, New York, NY, USA, 2014. Association for Computing Machinery

  19. [19]

    S. Kang, G. An, and S. Yoo. A quantitative and qualitative evaluation of llm-based explainable fault localization.Proceedings of the ACM on Software Engineering, 1(FSE):1424–1446, 2024

  20. [20]

    A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. Bug localization with combination of deep learning and information retrieval. In2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC), pages 218–229. IEEE, 2017

  21. [21]

    X. B. D. Le, D. Lo, and C. Le Goues. History driven program repair. In2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), volume 1, pages 213–224. IEEE, 2016

  22. [22]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  23. [23]

    F. Li, J. Jiang, J. Sun, and H. Zhang. Hybrid automated program repair by combining large language models and program analysis.ACM Transactions on Software Engineering and Methodology, 34(7):1–28, 2025

  24. [24]

    X. Li, W. Li, Y. Zhang, and L. Zhang. Deepfl: Integrating multiple fault diagnosis dimensions for deep fault localization. InProceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis, pages 169–180, 2019

  25. [25]

    Z. Li, J. Wang, Z. Jiang, H. Mao, Z. Chen, J. Du, Y. Zhang, F. Zhang, D. Zhang, and Y. Liu. Dmqr-rag: Diverse multi-query rewriting for rag.arXiv preprint arXiv:2411.13154, 2024

  26. [26]

    K. Lin, K. Lo, J. E. Gonzalez, and D. Klein. Decomposing complex queries for tip-of-the-tongue retrieval.arXiv preprint arXiv:2305.15053, 2023

  27. [27]

    K. Liu, A. Koyuncu, D. Kim, and T. F. Bissyandé. Tbar: Revisiting template-based automated program repair. InProceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis, pages 31–42, 2019

  28. [28]

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172, 2023

  29. [29]

    Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity.arXiv preprint arXiv:2104.08786, 2021

  30. [30]

    X. Ma, Y. Gong, P. He, N. Duan, et al. Query rewriting in retrieval-augmented large language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  31. [31]

    Y. Ma, Q. Yang, R. Cao, B. Li, F. Huang, and Y. Li. Alibaba lingmaagent: Improving automated issue resolution via comprehensive repository exploration. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pages 238–249, 2025

  32. [32]

    Z. Ma, A. R. Chen, D. J. Kim, T.-H. Chen, and S. Wang. Llmparser: An exploratory study on using large language models for log parsing. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

  33. [33]

    Macháček, A

    R. Macháček, A. Grishina, M. Hort, and L. Moonen. The impact of fine-tuning large language models on automated program repair. arXiv preprint arXiv:2507.19909, 2025

  34. [34]

    Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018

  35. [35]

    X. Meng, X. Wang, H. Zhang, H. Sun, and X. Liu. Improving fault localization and program repair with deep semantic features and transferred knowledge. InProceedings of the 44th International Conference on Software Engineering, pages 1169–1180, 2022

  36. [36]

    F. Niu, C. Li, K. Liu, X. Xia, and D. Lo. When deep learning meets information retrieval-based bug localization: A survey.ACM Computing Surveys, 57(11):1–41, 2025

  37. [37]

    M. R. Parvez, W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang. Retrieval augmented code generation and summarization.arXiv preprint arXiv:2108.11601, 2021

  38. [38]

    Y. Qin, S. Wang, Y. Lou, J. Dong, K. Wang, X. Li, and X. Mao. Agentfl: Scaling llm-based fault localization to project-level context.arXiv preprint arXiv:2403.16362, 2024. , Vol. 1, No. 1, Article . Publication date: May 2026. 44•Md Afif Al Mamun and Gias Uddin

  39. [39]

    R. Qu, R. Tu, and F. Bao. Is semantic chunking worth the computational cost? InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2155–2177, 2025

  40. [40]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019

  41. [41]

    R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry. Improving bug localization using structured information retrieval. In2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 345–355. IEEE, 2013

  42. [42]

    A. M. Samir and M. M. Rahman. Improved ir-based bug localization with intelligent relevance feedback.arXiv preprint arXiv:2501.10542, 2025

  43. [43]

    Sawarkar, A

    K. Sawarkar, A. Mangal, and S. R. Solanki. Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers. In2024 IEEE 7th international conference on multimedia information processing and retrieval (MIPR), pages 155–161. IEEE, 2024

  44. [44]

    Shao and T

    S. Shao and T. Yu. Enhancing ir-based fault localization using large language models.arXiv preprint arXiv:2412.03754, 2024

  45. [45]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  46. [46]

    The developer coefficient: Software engineering efficiency and its $3 trillion impact on global GDP

    Stripe. The developer coefficient: Software engineering efficiency and its $3 trillion impact on global GDP. https://stripe.com/files/ reports/the-developer-coefficient.pdf, Sept. 2018. Accessed: 2026-03-24

  47. [47]

    Y. Tao, Y. Qin, and Y. Liu. Retrieval-augmented code generation: A survey with focus on repository-level approaches.arXiv preprint arXiv:2510.04905, 2025

  48. [48]

    Q. Wang, C. Parnin, and A. Orso. Evaluating the usefulness of ir-based fault localization techniques. InProceedings of the 2015 international symposium on software testing and analysis, pages 1–11, 2015

  49. [49]

    Wang and D

    S. Wang and D. Lo. Version history, similar report, and structure: Putting them together for improved bug localization. InProceedings of the 22nd international conference on program comprehension, pages 53–63, 2014

  50. [50]

    X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

  51. [51]

    W. E. Wong, V. Debroy, R. Gao, and Y. Li. The dstar method for effective software fault localization.IEEE Transactions on Reliability, 63(1):290–308, 2013

  52. [52]

    W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa. A survey on software fault localization.IEEE Transactions on Software Engineering, 42(8):707–740, 2016

  53. [53]

    W. E. Wong, R. Gao, Y. Li, R. Abreu, F. Wotawa, and D. Li. Software fault localization: An overview of research, techniques, and tools. Handbook of Software Fault Localization: Foundations and Advances, pages 1–117, 2023

  54. [54]

    Y. Wu, Z. Li, J. M. Zhang, M. Papadakis, M. Harman, and Y. Liu. Large language models in fault localisation.arXiv preprint arXiv:2308.15276, 2023

  55. [55]

    C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Demystifying llm-based software engineering agents.Proc. ACM Softw. Eng., 2(FSE), June 2025

  56. [56]

    C. S. Xia, Y. Wei, and L. Zhang. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023

  57. [57]

    C. S. Xia and L. Zhang. Conversational automated program repair.arXiv preprint arXiv:2301.13246, 2023

  58. [58]

    Y. Xiao, J. Keung, K. E. Bennin, and Q. Mi. Improving bug localization with word embedding and enhanced convolutional neural networks.Information and Software Technology, 105:17–29, 2019

  59. [59]

    B. Yang, Z. Cai, F. Liu, B. Le, L. Zhang, T. F. Bissyandé, Y. Liu, and H. Tian. A survey of llm-based automated program repair: Taxonomies, design paradigms, and applications.arXiv preprint arXiv:2506.23749, 2025

  60. [60]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

  61. [61]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

  62. [62]

    Zhang, Y

    M. Zhang, Y. Li, X. Li, L. Chen, Y. Zhang, L. Zhang, and S. Khurshid. An empirical study of boosting spectrum-based fault localization via pagerank.IEEE Transactions on Software Engineering, 47(6):1089–1113, 2019

  63. [63]

    Zhang, C

    Q. Zhang, C. Fang, Y. Xie, Y. Ma, W. Sun, Y. Yang, and Z. Chen. A systematic literature review on large language models for automated program repair.arXiv preprint arXiv:2405.01466, 2024

  64. [64]

    Zhang, T

    T. Zhang, T. Yu, T. Hashimoto, M. Lewis, W.-t. Yih, D. Fried, and S. Wang. Coder reviewer reranking for code generation. InInternational Conference on Machine Learning, pages 41832–41846. PMLR, 2023

  65. [65]

    Zhang, H

    Y. Zhang, H. Ruan, Z. Fan, and A. Roychoudhury. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592–1604, 2024

  66. [66]

    Zhang, X

    Y. Zhang, X. Zhao, Z. Z. Wang, C. Yang, J. Wei, and T. Wu. cast: Enhancing code retrieval-augmented generation with structural chunking via abstract syntax tree.arXiv preprint arXiv:2506.15655, 2025

  67. [67]

    Zhang, Q

    Z. Zhang, Q. Dai, X. Bo, C. Ma, R. Li, X. Chen, J. Zhu, Z. Dong, and J.-R. Wen. A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems, 43(6):1–47, 2025. , Vol. 1, No. 1, Article . Publication date: May 2026. BLAgent: Agentic RAG for File-Level Bug Localization•45

  68. [68]

    Zhang, Y

    Z. Zhang, Y. Lei, X. Mao, M. Yan, L. Xu, and X. Zhang. A study of effectiveness of deep learning in locating real faults.Information and Software Technology, 131:106486, 2021

  69. [69]

    Y. Zhao, S. Chen, J. Zhang, and Z. Li. Recode: Improving llm-based code repair with fine-grained retrieval-augmented generation.arXiv preprint arXiv:2509.02330, 2025

  70. [70]

    J. Zhou, H. Zhang, and D. Lo. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In2012 34th International conference on software engineering (ICSE), pages 14–24. IEEE, 2012. , Vol. 1, No. 1, Article . Publication date: May 2026