Recognition: no theorem link
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization
Pith reviewed 2026-05-11 02:54 UTC · model grok-4.3
The pith
SPARK retrieves similar past test failures to annotate suspicious lines and guide LLMs toward more accurate fault locations in new failing tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SPARK integrates accumulated debugging knowledge from CI environments into LLM-based TCFL by retrieving similar fault-labeled test cases from a knowledge corpus and selectively annotating suspicious lines of the failing test based on their similarity to previously observed fault patterns. These annotations guide the LLM's reasoning while maintaining scalability and avoiding prompt-length explosion. On three industrial datasets of real-world faulty Python test cases, SPARK identifies more correct faulty locations than the existing LLM-based baseline, particularly in complex multi-fault cases, while keeping inference cost and token usage comparable.
What carries the argument
The selective annotation step, which transfers fault labels from retrieved similar cases onto suspicious lines in the target failing test to focus the LLM's attention.
If this is right
- More correct faulty locations are identified in complex tests that contain multiple faults.
- Fault localization effectiveness rises while inference cost and token usage stay comparable to the unaugmented baseline.
- The approach scales to large test suites without causing prompt-length problems that plague naive retrieval methods.
- It works on real industrial Python test cases drawn from different software products.
Where Pith is reading between the lines
- The same retrieval-and-annotation idea could be tested on non-Python test languages if equivalent fault-labeled corpora are collected.
- Combining the annotations with additional signals such as execution traces might further narrow the search space in black-box settings.
- Maintaining an evolving CI corpus could allow the system to improve automatically as new faults are discovered and labeled.
Load-bearing premise
That cases retrieved from the CI corpus will share fault patterns accurate enough to annotate the new test without adding misleading noise that hurts the LLM's reasoning.
What would settle it
Run the same three industrial datasets through the baseline LLM approach but with the annotation step removed or replaced by random line marks, then check whether the number of correctly localized faults drops, especially on the multi-fault subset.
Figures
read the original abstract
Software failures remain a major challenge in modern software development, and identifying the code elements responsible for failures is a time-consuming debugging task. While extensive research has focused on fault localization in the system under test (SUT), failures can also originate from faulty system test scripts. This problem, known as Test Code Fault Localization (TCFL), has received significantly less attention despite its importance in continuous integration (CI) environments where large test suites are executed frequently. TCFL is particularly challenging because it typically operates under black-box conditions, relies on limited diagnostic signals such as error messages and partial logs, and involves large system-level test scripts that expand the fault localization search space. In this paper, we propose SPARK, a framework that integrates accumulated debugging knowledge from continuous integration (CI) environments into Large Language Model (LLM)-based TCFL. Given a newly observed failing test case, SPARK retrieves similar fault-labeled test cases from a debugging knowledge corpus and selectively annotates suspicious lines of the failing test based on their similarity to previously observed fault patterns. These annotations guide the LLM's reasoning while maintaining scalability and avoiding the prompt-length explosion common to naive retrieval-augmented approaches. We evaluate SPARK on three industrial datasets containing real-world faulty Python test cases from different software products. The results show that SPARK consistently improves fault localization effectiveness compared to the existing LLM-based TCFL baseline while maintaining comparable inference cost and token usage. In particular, the approach advances the state of the art by identifying more correct faulty locations in complex test cases containing multiple faults.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SPARK, a retrieval-augmented framework for LLM-based Test Code Fault Localization (TCFL). Given a failing test, SPARK retrieves similar fault-labeled cases from a CI debugging knowledge corpus and selectively annotates suspicious lines based on pattern similarity; these annotations are then provided to the LLM to improve localization. The approach is evaluated on three industrial datasets of real-world faulty Python test cases, with the central claim that SPARK yields consistent gains in fault-localization effectiveness over an existing LLM-based TCFL baseline, especially on complex multi-fault tests, while preserving comparable inference cost and token usage.
Significance. If the reported gains prove robust, SPARK would represent a practical advance in an under-studied area of test-script debugging within CI pipelines. By leveraging historical fault patterns without naive retrieval-induced prompt bloat, the method could improve LLM reasoning on large system-level tests where diagnostic signals are limited.
major comments (2)
- [Method / SPARK Framework] The method section provides no formal definition of the similarity metric, no pseudocode for the retrieval or line-selection procedure, and no explicit handling of partial matches when a test contains multiple independent faults. This directly underpins the central claim that retrieved annotations improve rather than degrade LLM output; without these details it is impossible to assess whether surface-level similarity (e.g., token overlap or error strings) reliably identifies causal fault locations.
- [Evaluation / Results] The evaluation reports “consistent improvement” and “more correct faulty locations in complex test cases containing multiple faults,” yet supplies no numerical metrics (e.g., Top-1/Top-5 accuracy, EXAM score), no statistical significance tests, no ablation on the annotation component, and no breakdown by number of faults per test. These omissions make it impossible to verify the robustness of the multi-fault claim or to compare effect sizes against the baseline.
minor comments (2)
- [Abstract] The abstract states that annotations “guide the LLM’s reasoning while maintaining scalability,” but the paper never quantifies prompt-length growth or token usage beyond the qualitative claim of “comparable” cost.
- [Preliminaries / Notation] Notation for the knowledge corpus, similarity function, and annotation mask is introduced without a dedicated notation table or consistent symbols across figures and text.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to improve clarity in the method description and rigor in the evaluation. We address each major comment below and will incorporate the suggested changes in the revised manuscript.
read point-by-point responses
-
Referee: [Method / SPARK Framework] The method section provides no formal definition of the similarity metric, no pseudocode for the retrieval or line-selection procedure, and no explicit handling of partial matches when a test contains multiple independent faults. This directly underpins the central claim that retrieved annotations improve rather than degrade LLM output; without these details it is impossible to assess whether surface-level similarity (e.g., token overlap or error strings) reliably identifies causal fault locations.
Authors: We agree that a formal definition of the similarity metric, pseudocode, and explicit discussion of multi-fault handling would strengthen the method section and aid reproducibility. In the revised manuscript we will add a formal definition of the similarity metric (based on pattern similarity between the current failing test and historical fault-labeled cases), include pseudocode for the retrieval and selective annotation steps, and explain how partial matches are managed: each line is annotated independently according to its similarity to observed fault patterns, enabling the framework to surface multiple faults without requiring an exact overall test match. These additions will directly support the claim that the annotations improve LLM localization. revision: yes
-
Referee: [Evaluation / Results] The evaluation reports “consistent improvement” and “more correct faulty locations in complex test cases containing multiple faults,” yet supplies no numerical metrics (e.g., Top-1/Top-5 accuracy, EXAM score), no statistical significance tests, no ablation on the annotation component, and no breakdown by number of faults per test. These omissions make it impossible to verify the robustness of the multi-fault claim or to compare effect sizes against the baseline.
Authors: We acknowledge that the evaluation would benefit from greater quantitative detail and additional analyses. In the revised manuscript we will expand the results section to report Top-1 and Top-5 accuracy as well as EXAM scores for SPARK versus the baseline on all three datasets, include statistical significance testing, present an ablation study isolating the selective annotation component, and provide a breakdown of performance by number of faults per test (single-fault versus multi-fault cases). These changes will allow readers to verify the reported gains and effect sizes more rigorously. revision: yes
Circularity Check
No circularity: empirical retrieval method relies on external corpus and direct evaluation
full rationale
The paper describes SPARK as a retrieval-augmented framework that pulls similar fault-labeled cases from an external CI debugging corpus, selectively annotates lines in a new failing test, and feeds the result to an LLM for localization. Evaluation is performed on three separate industrial datasets with real faulty Python tests, reporting improvements over an LLM baseline in effectiveness metrics while holding inference cost constant. No equations, fitted parameters, or first-principles derivations appear in the provided text; the central claim is supported by empirical comparison rather than any self-referential definition, uniqueness theorem, or ansatz smuggled via self-citation. The method is therefore self-contained against external benchmarks and receives a score of 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Retrieved similar fault-labeled test cases provide useful and non-misleading annotations for guiding LLM fault localization in new tests
Reference graph
Works this paper leans on
-
[1]
Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund. 2006. An Evaluation of Similarity Coefficients for Software Fault Localization. In12th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2006), 18-20 December, 2006, University of California, Riverside, USA. IEEE Computer Society, 39–46. doi:10.1109/PRDC.2006.18
-
[2]
Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006.Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., USA
work page 2006
-
[3]
Qurat Ul Ain, Wasi Haider Butt, Muhammad Waseem Anwar, Farooque Azam, and Bilal Maqbool. 2019. A Systematic Review on Code Clone Detection.IEEE Access7 (2019), 86121–86144. doi:10.1109/ACCESS.2019.2918202
-
[4]
Benoit Baudry, Franck Fleurey, Jean-Marc Jézéquel, and Yves Le Traon. 2005. Automatic Test Case Optimization: A Bacteriologic Algorithm.IEEE Softw.22, 2 (2005), 76–82. doi:10.1109/MS.2005.30
-
[5]
Soremekun, Sudipta Chattopadhyay, Emamurho Ugherughe, and Andreas Zeller
Marcel Böhme, Ezekiel O. Soremekun, Sudipta Chattopadhyay, Emamurho Ugherughe, and Andreas Zeller. 2017. Where is the bug and how is it fixed? an experiment with practitioners. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering(Paderborn, Germany) (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 117...
-
[6]
Yuriy Brun, Saikat Chakraborty, Claire Le Goues, Corina Păsăreanu, and Adish Singla. 2026. Automatically Engineering Trusted Software: A Research Roadmap.ACM Trans. Softw. Eng. Methodol.(March 2026). doi:10.1145/3779132 Just Accepted
-
[7]
Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhiming Ma, and Hang Li. 2009. Ranking measures and loss functions in learning to rank. InProceedings of the 23rd International Conference on Neural Information Processing Systems(Vancouver, British Columbia, Canada)(NIPS’09). Curran Associates Inc., Red Hook, NY, USA, 315âĂŞ323
work page 2009
- [8]
- [9]
-
[10]
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-context Learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Comput...
-
[11]
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel MazarÃľ, Maria Lomeli, Lucas Hosseini, and HervÃľ JÃľgou. 2024. The Faiss library.arXiv e-prints(2024). arXiv:2401.08281 [cs.LG]
work page internal anchor Pith review arXiv 2024
-
[12]
Bin Du, Xiaolan Kang, Hexiang Xu, Yonghao Wu, and Yong Liu. 2025. Leveraging Retrieval Augmented Generation to Enhance LLM-Based Fault Localization for Novice Programs. In2025 25th International Conference on Software Quality, Reliability and Security (QRS). 46–56
work page 2025
-
[13]
2007.Continuous integration: improving software quality and reducing risk
Paul M Duvall, Steve Matyas, and Andrew Glover. 2007.Continuous integration: improving software quality and reducing risk. Pearson Education
work page 2007
-
[14]
Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco. 2024. What did I do wrong? Quantifying LLMs’ sensitivity and consistency to prompt engineering.arXiv preprint arXiv:2406.12334(2024). Manuscript submitted to ACM Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization 37
-
[15]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational L...
-
[16]
Ruslan Galiullin and Yegor Bugayenko. 2024. Code Quality Analysis: Exploring Blank Lines as Indicators of Increased Code Complexity.Zenodo (2024)
work page 2024
- [17]
-
[18]
Neha Gupta, Arun Sharma, and Manoj Kumar Pachariya. 2019. An Insight Into Test Case Optimization: Ideas and Trends With Future Perspectives. IEEE Access7 (2019), 22310–22327. doi:10.1109/ACCESS.2019.2899471
-
[19]
Ahmed E. Hassan. 2008. The road ahead for Mining Software Repositories . In2008 IEEE International Conference on Software Maintenance. IEEE Computer Society, Los Alamitos, CA, USA, 48–57. doi:10.1109/FOSM.2008.4659248
-
[20]
Hao Hu, Hongyu Zhang, Jifeng Xuan, and Weigang Sun. 2014. Effective Bug Triage Based on Historical Bug-Fix Information. In2014 IEEE 25th International Symposium on Software Reliability Engineering. 122–132. doi:10.1109/ISSRE.2014.17
- [21]
-
[22]
Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2016. Learning Unified Features from Natural and Programming Languages for Locating Buggy Source Code. InInternational Joint Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:18198301
work page 2016
- [23]
-
[24]
Darryl Jarman, Jeffrey Berry, Riley Smith, Ferdian Thung, and David Lo. 2022. Legion: Massively Composing Rankers for Improved Bug Localization at Adobe.IEEE Transactions on Software Engineering48, 8 (2022), 3010–3024. doi:10.1109/TSE.2021.3075215
-
[25]
Suhwan Ji, Sanghwa Lee, Changsup Lee, Yo-Sub Han, and Hyeonseung Im. 2025. Impact of Large Language Models of Code on Fault Localization. In 2025 IEEE Conference on Software Testing, Verification and Validation (ICST). 302–313. doi:10.1109/ICST62969.2025.10989036
-
[26]
Jones, Mary Jean Harrold, and John T
James A. Jones, Mary Jean Harrold, and John T. Stasko. 2002. Visualization of test information to assist fault localization. InProceedings of the 24th International Conference on Software Engineering, ICSE 2002, 19-25 May 2002, Orlando, Florida, USA, Will Tracz, Michal Young, and Jeff Magee (Eds.). ACM, 467–477. doi:10.1145/581339.581397
-
[27]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 International Symposium on Software Testing and Analysis(San Jose, CA, USA)(ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437âĂŞ440. doi:10.1145/2610384.2628055
-
[28]
Sonia K Katyal. 2018. The paradox of source code secrecy.Cornell L. Rev.104 (2018), 1183
work page 2018
-
[29]
Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, Xin Zhou, Enzhi Wang, and Xiaohang Dong. 2024. Better Zero-Shot Reasoning with Role-Play Prompting. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico C...
-
[30]
Tuan Manh Lai, Trung Bui, and Sheng Li. 2018. A Review on Deep Learning Techniques Applied to Answer Selection. InProceedings of the 27th International Conference on Computational Linguistics, Emily M. Bender, Leon Derczynski, and Pierre Isabelle (Eds.). Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2132–2144. https://aclanthology....
work page 2018
-
[31]
An Ngoc Lam, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2017. Bug Localization with Combination of Deep Learning and Information Retrieval. In2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). 218–229. doi:10.1109/ICPC.2017.24
-
[32]
V. I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.Soviet Physics Doklady10 (Feb. 1966), 707
work page 1966
-
[33]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(Van...
work page 2020
-
[34]
Hongyan Li, Weifeng Sun, Meng Yan, Ling Xu, Qiang Li, Xiaohong Zhang, and Hongyu Zhang. 2025. Retrieval-Augmented Fine-Tuning for Improving Retrieve-and-Edit Based Assertion Generation.IEEE Transactions on Software Engineering51, 5 (2025), 1591–1614. doi:10.1109/TSE.2025.3558403
-
[35]
Xia Li, Jiajun Jiang, Samuel Benton, Yingfei Xiong, and Lingming Zhang. 2021. A Large-scale Study on API Misuses in the Wild. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). 241–252. doi:10.1109/ICST49551.2021.00034
-
[36]
Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang. 2025. A Knowledge Enhanced Large Language Model for Bug Localization.Proc. ACM Softw. Eng.2, FSE, Article FSE086 (June 2025), 23 pages. doi:10.1145/3729356
-
[37]
Yi Li, Shaohua Wang, and Tien N. Nguyen. 2021. Fault Localization with Code Coverage Representation Learning. In43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 661–673. doi:10.1109/ICSE43902.2021.00067
-
[38]
Zheng Li, Xue Bai, Haifeng Wang, and Yong Liu. 2020. IRBFL: An Information Retrieval Based Fault Localization Approach. In44th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2020, Madrid, Spain, July 13-17, 2020. IEEE, 991–996. doi:10.1109/COMPSAC48688.2020.0- 142 Manuscript submitted to ACM 38 Golnaz Gharachorlu, Mahsa Panahandeh, ...
-
[39]
Hongliang Liang, Dengji Hang, and Xiangyu Li. 2022. Modeling function-level interactions for file-level bug localization.Empirical Software Engineering27, 7 (2022), 186
work page 2022
-
[40]
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What Makes Good In-Context Examples for GPT-3?. InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, Eneko Agirre, Marianna Apidianaki, and Ivan Vulić (Eds.). Association ...
-
[41]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.ACM Comput. Surv.55, 9, Article 195 (Jan. 2023), 35 pages. doi:10.1145/3560815
-
[42]
Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2020. K-BERT: Enabling Language Representation with Knowledge Graph.Proceedings of the AAAI Conference on Artificial Intelligence34, 03 (Apr. 2020), 2901–2908. doi:10.1609/aaai.v34i03.5681
-
[43]
Yiling Lou, Qihao Zhu, Jinhao Dong, Xia Li, Zeyu Sun, Dan Hao, Lu Zhang, and Lingming Zhang. 2021. Boosting coverage-based fault localization via graph-based representation learning. InESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, Diomidis...
-
[44]
Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Trans. Pattern Anal. Mach. Intell.42, 4 (April 2020), 824âĂŞ836. doi:10.1109/TPAMI.2018.2889473
-
[45]
Manning, Prabhakar Raghavan, and Hinrich SchÃijtze
Christopher D. Manning, Prabhakar Raghavan, and Hinrich SchÃijtze. 2008.Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. http://nlp.stanford.edu/IR-book/information-retrieval-book.html
work page 2008
-
[46]
Elijah Mansur, Johnson Chen, Muhammad Anas Raza, and Mohammad Wardat. 2024. RAGFix: Enhancing LLM Code Repair Using RAG and Stack Overflow Posts. In2024 IEEE International Conference on Big Data (BigData). 7491–7496. doi:10.1109/BigData62323.2024.10825785
-
[47]
Myers, Corey Sandler, and Tom Badgett
Glenford J. Myers, Corey Sandler, and Tom Badgett. 2011.The Art of Software Testing(3rd ed.). Wiley Publishing
work page 2011
-
[48]
Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2450–2462. doi:10.1109/ICSE48619.2023.00205
-
[49]
Srinivas Nidhra and Jagruthi Dondeti. 2012. Black box and white box testing techniques-a literature review.International Journal of Embedded Systems and Applications (IJESA)2, 2 (2012), 29–50
work page 2012
-
[50]
Owain Parry, Gregory Kapfhammer, Michael Hilton, and Phil McMinn. 2025. Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures. InProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE ’25). Association for Computing Machinery, New York, NY, USA, 476âĂŞ487. doi:10.1145/3756681.3756945
-
[51]
Francisco Ponce, Roberto Verdecchia, Breno Miranda, and Jacopo Soldani. 2025. Microservices testing: A systematic literature review.Information and Software Technology188 (2025), 107870. doi:10.1016/j.infsof.2025.107870
-
[52]
F. Pukelsheim. 1994. The Three Sigma Rule.The American Statistician48, 2 (1994), 88–91
work page 1994
-
[53]
Yihao Qin, Shangwen Wang, Yiling Lou, Jinhao Dong, Kaixin Wang, Xiaoling Li, and Xiaoguang Mao. 2024. AgentFL: Scaling LLM-based Fault Localization to Project-Level Context.CoRRabs/2403.16362 (2024). arXiv:2403.16362 doi:10.48550/ARXIV.2403.16362
-
[54]
Yihao Qin, Shangwen Wang, Yiling Lou, Jinhao Dong, Kaixin Wang, Xiaoling Li, and Xiaoguang Mao. 2025. SoapFL: A Standard Operating Procedure for LLM-Based Method-Level Fault Localization.IEEE Transactions on Software Engineering51, 4 (2025), 1173–1187. doi:10.1109/TSE.2025.3543187
-
[55]
Steven Raemaekers, Arie van Deursen, and Joost Visser. 2011. Exploring risks in the usage of third-party libraries. Inof the BElgian-NEtherlands software eVOLution seminar, Vol. 31
work page 2011
-
[56]
Moeketsi Raselimo and Bernd Fischer. 2024. Spectrum-based rule- and item-level localization of faults in context-free grammars.J. Syst. Softw.215 (2024), 112067. doi:10.1016/J.JSS.2024.112067
-
[57]
Michael Rath, David Lo, and Patrick Mäder. 2018. Analyzing requirements and traceability information to improve bug localization. InProceedings of the 15th International Conference on Mining Software Repositories(Gothenburg, Sweden)(MSR ’18). Association for Computing Machinery, New York, NY, USA, 442âĂŞ453. doi:10.1145/3196398.3196415
-
[58]
S.E. Robertson and K. Spärck Jones. 1994.Simple, proven approaches to text retrieval. Technical Report UCAM-CL-TR-356. University of Cambridge, Computer Laboratory. doi:10.48456/tr-356
-
[59]
Ahmadreza Saboor Yaraghi, Golnaz Gharachorlu, Sakina Fatima, Lionel C Briand, Ruiyuan Wan, and Ruifeng Gao. 2025. Black-Box Test Code Fault Localization Driven by Large Language Models and Execution Estimation.arXiv e-prints(2025), arXiv–2506
work page 2025
-
[60]
David Saff and Michael D Ernst. 2003. Reducing wasted development time via continuous testing. In14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003.IEEE, 281–292
work page 2003
-
[61]
Qusay Idrees Sarhan and Árpád Beszédes. 2022. A Survey of Challenges in Spectrum-Based Software Fault Localization.IEEE Access10 (2022), 10618–10639. doi:10.1109/ACCESS.2022.3144079
-
[62]
Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering50, 1 (2023), 85–105
work page 2023
-
[64]
Xinyu Shi, Zhenhao Li, and An Ran Chen. 2025. Enhancing LLM-based Fault Localization with a Functionality-Aware Retrieval-Augmented Generation Framework.arXiv preprint arXiv:2509.20552(2025). Manuscript submitted to ACM Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization 39
-
[65]
Sheldon Smith, Ethan Robinson, Timmy Frederiksen, Trae Stevens, Tomas Cerny, Miroslav Bures, and Davide Taibi. 2023. Benchmarks for End-to-End Microservices Testing. In2023 IEEE International Conference on Service-Oriented System Engineering (SOSE). 60–66. doi:10.1109/SOSE58276.2023.00013
- [66]
-
[67]
Xuezhi Song, Yun Lin, Siang Hwee Ng, Yijian Wu, Xin Peng, Jin Song Dong, and Hong Mei. 2022. RegMiner: towards constructing a large regression dataset from code evolution history. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis(Virtual, South Korea)(ISSTA 2022). Association for Computing Machinery, New York, ...
-
[68]
Daniela Steidl, Benjamin Hummel, and Elmar Juergens. 2013. Quality analysis of source code comments. In2013 21st International Conference on Program Comprehension (ICPC). 83–92. doi:10.1109/ICPC.2013.6613836
-
[69]
Hanzhuo Tan, Qi Luo, Ling Jiang, Zizheng Zhan, Jing Li, Haotian Zhang, and Yuqun Zhang. 2025. Prompt-based Code Completion via Multi-Retrieval Augmented Generation.ACM Trans. Softw. Eng. Methodol.(March 2025). doi:10.1145/3725812 Just Accepted
-
[70]
Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T
David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio- GonzÃąlez. 2019. BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 339–349. doi:10.1109/ICSE.2019.00048
-
[71]
Arash Vahabzadeh, Amin Milani Fard, and Ali Mesbah. 2015. An empirical study of bugs in test code. In2015 IEEE International Conference on Software Maintenance and Evolution (ICSME). 101–110. doi:10.1109/ICSM.2015.7332456
-
[72]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
work page 2017
-
[73]
Bei Wang, Ling Xu, Meng Yan, Chao Liu, and Ling Liu. 2022. Multi-Dimension Convolutional Neural Network for Bug Localization.IEEE Transactions on Services Computing15, 3 (2022), 1649–1663. doi:10.1109/TSC.2020.3006214
-
[74]
Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li. 2024. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs.NPJ digital medicine7, 1 (2024), 41
work page 2024
-
[75]
Shaowei Wang, David Lo, and Julia Lawall. 2014. Compositional Vector Space Models for Improved Bug Localization. In2014 IEEE International Conference on Software Maintenance and Evolution. 171–180. doi:10.1109/ICSME.2014.39
-
[76]
Weishi Wang, Yue Wang, Shafiq Joty, and Steven C.H. Hoi. 2023. RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing Machin...
-
[77]
Ying Wang, Bihuan Chen, Kaifeng Huang, Bowen Shi, Congying Xu, Xin Peng, Yijian Wu, and Yang Liu. 2020. An Empirical Study of Usages, Updates and Risks of Third-Party Libraries in Java Projects. InIEEE International Conference on Software Maintenance and Evolution, ICSME 2020, Adelaide, Australia, September 28 - October 2, 2020. IEEE, 35–45. doi:10.1109/I...
-
[78]
Generalizing from a Few Examples: A Survey on Few-Shot Learning.ACM Comput
Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. 2021. Generalizing from a Few Examples: A Survey on Few-shot Learning.ACM Comput. Surv.53, 3 (2021), 63:1–63:34. doi:10.1145/3386252
-
[79]
Ming Wen, Junjie Chen, Yongqiang Tian, Rongxin Wu, Dan Hao, Shi Han, and Shing-Chi Cheung. 2021. Historical Spectrum Based Fault Localization. IEEE Trans. Software Eng.47, 11 (2021), 2348–2368. doi:10.1109/TSE.2019.2948158
-
[80]
Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt
-
[81]
CLIN: A continually learning language agent for rapid task adaptation and generalization.arXiv, 2023
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT.CoRRabs/2302.11382 (2023). arXiv:2302.11382 doi:10.48550/ARXIV. 2302.11382
work page internal anchor Pith review doi:10.48550/arxiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.