Recognition: no theorem link
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
Pith reviewed 2026-05-15 01:48 UTC · model grok-4.3
The pith
Retriever components, especially the algorithm, often influence RAG performance for software engineering tasks more than the generator model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The empirical study isolates the effects of query processing, retrieval models including BM25, context refinement, and generators across three SE tasks, revealing that the retrieval algorithm choice frequently has a larger impact on system performance than the generator model selection, with the lexical retriever BM25 performing robustly across tasks.
What carries the argument
The component-wise isolation and evaluation of RAG pipeline elements, with special focus on the retrieval algorithm's role in determining overall performance.
If this is right
- Optimizing retrieval algorithms can provide greater performance improvements than changing the generator model.
- BM25 serves as a reliable and effective retrieval method for various software engineering RAG applications.
- System builders should prioritize retrieval-side enhancements when developing RAG for code-related tasks.
Where Pith is reading between the lines
- Lexical retrieval like BM25 may excel in code tasks because exact matches to identifiers and syntax are critical.
- These findings could extend to other retrieval-heavy domains beyond software engineering.
- Developers might achieve better results by combining strong retrievers with simpler generators to reduce costs.
Load-bearing premise
That the results from the three specific SE tasks and chosen models and datasets will hold for other software engineering problems and real-world codebases.
What would settle it
Running the same component comparisons on additional SE tasks such as code summarization for larger projects or different programming languages and checking if the retriever still dominates performance.
Figures
read the original abstract
While Retrieval-Augmented Generation (RAG) is increasingly adopted to ground Large Language Models (LLMs) in software artifacts, the optimal configuration of its components remains an open question for software engineering (SE) tasks. The lack of systematic guidance forces practitioners into costly, ad-hoc experimentation. This paper presents a comprehensive, component-wise empirical study that dissects the RAG pipeline, evaluating over 21 distinct models and methods. Our study systematically isolates and evaluates 4 query processing techniques, 7 retrieval models spanning sparse, dense, and hybrid paradigms, 4 context refinement methods, and 6 distinct generators. We test these components on a suite of 3 core SE tasks: code generation, summarization, and repair. Our empirical findings reveal a crucial insight: the retriever-side components, particularly the choice of the retrieval algorithm, often exert a more significant influence on final system performance than the selection of the generator model. Strikingly, the classic lexical retriever BM25 demonstrates exceptionally robust performance across diverse tasks. Our analysis provides a practical, data-driven roadmap for researchers and practitioners, offering clear guidance on prioritizing optimization efforts when constructing effective RAG systems for software engineering contexts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a component-wise empirical study of Retrieval-Augmented Generation (RAG) pipelines for three software engineering tasks (code generation, summarization, and repair). It isolates and evaluates 4 query processing techniques, 7 retrieval models (sparse, dense, and hybrid), 4 context refinement methods, and 6 generators, reporting that retriever-side components—particularly the choice of retrieval algorithm—exert greater influence on final performance than generator selection, with the classic BM25 retriever showing robust results across tasks.
Significance. If the comparative influence findings hold under controlled analysis, the work supplies actionable, data-driven guidance for SE practitioners constructing RAG systems and highlights that retrieval choices may warrant higher priority than generator upgrades. The emphasis on BM25's consistent performance offers a concrete, low-cost baseline that could reduce reliance on expensive neural retrievers in code-related applications.
major comments (2)
- [Results and Analysis] The central claim that retriever-side components exert more influence than generator selection requires matched ablations: performance deltas (or ranges) across the 7 retrieval models for each fixed generator must be directly compared against deltas across the 6 generators for each fixed retriever (e.g., via max or average spread, or factorial ANOVA). The abstract and results presentation do not report such effect-size comparisons, so the 'more significant' assertion rests on an unverified assumption about comparable magnitudes.
- [Experimental Setup] No details are supplied on statistical testing, variance or standard deviation across runs, confidence intervals, or exact dataset sizes and splits. This absence makes it impossible to determine whether observed component rankings and task differences are reliable or could be artifacts of single-run noise or metric scaling.
minor comments (1)
- [Abstract] The abstract states 'over 21 distinct models and methods' while the component counts sum exactly to 21; confirm that the full text consistently reports the total number of unique RAG configurations actually evaluated rather than the sum of component options.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: [Results and Analysis] The central claim that retriever-side components exert more influence than generator selection requires matched ablations: performance deltas (or ranges) across the 7 retrieval models for each fixed generator must be directly compared against deltas across the 6 generators for each fixed retriever (e.g., via max or average spread, or factorial ANOVA). The abstract and results presentation do not report such effect-size comparisons, so the 'more significant' assertion rests on an unverified assumption about comparable magnitudes.
Authors: We agree that quantifying the relative influence through matched effect-size comparisons would strengthen the central claim. In the revised manuscript, we will add a dedicated analysis computing performance deltas (max-min ranges and average spreads) across the 7 retrieval models for each fixed generator, and directly compare these to the deltas across the 6 generators for each fixed retriever. These results will be presented in an additional table or figure in the results section, with discussion of the magnitudes. We will also update the abstract to reference the quantified comparison. This revision directly addresses the concern. revision: yes
-
Referee: [Experimental Setup] No details are supplied on statistical testing, variance or standard deviation across runs, confidence intervals, or exact dataset sizes and splits. This absence makes it impossible to determine whether observed component rankings and task differences are reliable or could be artifacts of single-run noise or metric scaling.
Authors: We acknowledge that these experimental details were omitted. In the revision, we will explicitly report the exact dataset sizes and splits used for each of the three tasks (code generation, summarization, and repair). However, all experiments were conducted as single runs per configuration to manage the substantial computational cost of the full combinatorial evaluation. As a result, we do not have variance, standard deviations, or confidence intervals from multiple runs, and cannot add statistical testing without new experiments. We will state this limitation clearly in the experimental setup section and discuss its implications for interpreting the rankings. revision: partial
- The absence of variance, standard deviations, confidence intervals, and formal statistical testing, which cannot be added without re-running the full set of experiments with multiple random seeds.
Circularity Check
No circularity: purely empirical component comparison
full rationale
The paper performs a direct empirical ablation across 4 query processors, 7 retrievers, 4 refiners, and 6 generators on three fixed SE tasks using standard metrics. No equations, fitted parameters, or predictions appear; all reported influences are measured performance deltas on external datasets. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling exist. The retriever-vs-generator claim rests on observed spreads rather than any reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three selected SE tasks and the chosen models/datasets are sufficiently representative to support general recommendations about component importance.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Hany Awadalla, et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning Distributed Representations of Code.Proc. ACM Program. Lang.3, POPL (2019), 40:1–40:29. doi:10.1145/3290353
-
[3]
Muhammad Arslan, Hussam Ghanem, Saba Munawar, and Christophe Cruz. 2024. A Survey on RAG with LLMs. Procedia Computer Science246 (2024), 3781–3790. doi:10.1016/j.procs.2024.09.178
- [4]
-
[5]
Nguyen, Hridesh Rajan, Nikolaos Tsantalis, and Danny Dig
Abhiram Bellur, Fraol Batole, Mohammed Raihan Ullah, Malinda Dilhara, Yaroslav Zharov, Timofey Bryksin, Kai Ishikawa, Haifeng Chen, Masaharu Morimoto, Takeo Hosomi, Tien N. Nguyen, Hridesh Rajan, Nikolaos Tsantalis, and Danny Dig. 2025. Together We are Better: LLM, IDE and Semantic Embedding to Assist Move Method Refactoring. In Proceedings of the 41st IE...
-
[6]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, et al . 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [8]
-
[9]
Gordon V. Cormack, Charles L. A. Clarke, and Stefan Bottcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, James Allan, Javed A. Aslam, Mark Sanderson, ChengXiang Zhai, and Justin Zobel (Eds.). ACM,...
-
[10]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, et al. 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, Barcelona Spain, 6491–6501. doi:10.1145/3637528. 3671470
-
[12]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, et al . 2020. Codebert: A pre-trained model for programming and natural languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Stroudsburg, PA, USA, 1536–1547
work page 2020
-
[13]
Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. Precise Zero-Shot Dense Retrieval without Relevance Labels. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for C...
-
[14]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey.IEEE Transactions on Software Engineering45, 1 (2019), 34–67. doi:10.1109/TSE.2017.2755013
-
[16]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI]
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [17]
-
[18]
Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. 2024. Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents. arXiv:2310.19923 [cs.CL]
-
[19]
Dan Hendrycks, Steven Basart, Saurav Kadavath, et al. 2021. Measuring Coding Challenge Competence With APPS. arXiv:2105.09938 [cs.SE] Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE183. Publication date: July 2026. FSE183:22 Q. Ke, Y. Zhao, H. Leng, S. Zhao, and H. Wang
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt
-
[21]
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring Mathematical Problem Solving With the MATH Dataset. arXiv:2103.03874 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv
- [22]
-
[23]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025. OpenReview.net, Amherst, MA, USA, 1–15
work page 2025
-
[24]
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. LongLLM- Lingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand August 11-16, ...
-
[25]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs.IEEE Transactions on Big Data7, 3 (2021), 535–547. doi:10.1109/TBDATA.2019.2921572
- [26]
-
[27]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv:2308.03281 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [29]
- [30]
- [31]
- [32]
-
[33]
Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR]
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[34]
OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [35]
-
[36]
Shuo Ren, Daya Guo, Shuai Lu, et al . 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE]
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[37]
S. E. Robertson and S. Walker. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. InSIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer-Verlag, London, UK, 232–241
work page 1994
- [38]
-
[39]
Lu Shuai, Duan Nan, Han Hojae, Guo Daya, Hwang Seung-won, and Svyatkovskiy Alexey. 2022. ReACC: A Retrieval- Augmented Code Completion Framework. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 6227–6240. doi:10. 18653/v1/2022.acl...
work page 2022
-
[40]
Jonathan Sillito, Frank Maurer, Seyed Mehdi Nasehi, and Chris Burns. 2012. What makes a good code example? A study of programming Q and A in StackOverflow. InProceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM). IEEE, Piscataway, NJ, USA, 25–34. doi:10.1109/ICSM.2012.6405249
-
[41]
Karen Spärck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval.Journal of Documentation28, 1 (1972), 11–21
work page 1972
-
[42]
Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, et al . 2024. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Associat...
- [43]
-
[44]
Christoph Treude, Ohad Barzilay, and Margaret-Anne Storey. 2011. How do programmers ask and answer questions on the web? (NIER track). InProceedings of the 33rd International Conference on Software Engineering. ACM, New York, NY, USA, 804–807. doi:10.1145/1985793.1985907
-
[45]
Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and Michael R. Lyu. 2022. No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 22). ACM, New York, NY, USA, 382–...
-
[46]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, et al. 2024. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv:2212.03533 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [47]
-
[48]
Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, and Xuanjing Huang. 2024. Searching for Best Practices in Retrieval-Augmented Generation. arXiv:2407.01219 [cs.CL]
- [49]
-
[50]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., Red Hook, NY, USA,...
work page 2022
- [51]
-
[52]
An Yang, Baosong Yang, Beichen Zhang, et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, and Xin Xia. 2025. An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities.ACM Trans. Softw. Eng. Methodol.34, 7 (2025), 188:1–188:28. doi:10.1145/3717061
-
[54]
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig
Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. 2023. DocPrompting: Generating Code by Retrieving the Docs. arXiv:2207.05987 [cs.SE]
-
[56]
Xiaoling Zhou, Ou Wu, Weiyao Zhu, and Ziyang Liang. 2022. Understanding Difficulty-Based Sample Weighting with a Universal Difficulty Measure. InMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part III. Springer-Verlag, Cham, Switzerland, 68–84. doi:10.1007/9...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.