arxiv: 2605.14503 · v1 · submitted 2026-05-14 · 💻 cs.SE

Recognition: no theorem link

Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

Qiang Ke , Yanjie Zhao , Hongjin Leng , Shengming Zhao , Haoyu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:48 UTC · model grok-4.3

classification 💻 cs.SE

keywords RAGRetrieval-Augmented GenerationSoftware EngineeringEmpirical EvaluationBM25Code GenerationLLMRetrieval Models

0 comments

The pith

Retriever components, especially the algorithm, often influence RAG performance for software engineering tasks more than the generator model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a detailed breakdown of Retrieval-Augmented Generation systems for software engineering tasks by testing many different combinations of components. It shows that how information is retrieved has a bigger effect on results than which large language model generates the final output. Practitioners can use this to focus their efforts on improving retrieval rather than endlessly testing new generators. The study covers code generation, summarization, and repair using over 21 models and methods.

Core claim

The empirical study isolates the effects of query processing, retrieval models including BM25, context refinement, and generators across three SE tasks, revealing that the retrieval algorithm choice frequently has a larger impact on system performance than the generator model selection, with the lexical retriever BM25 performing robustly across tasks.

What carries the argument

The component-wise isolation and evaluation of RAG pipeline elements, with special focus on the retrieval algorithm's role in determining overall performance.

If this is right

Optimizing retrieval algorithms can provide greater performance improvements than changing the generator model.
BM25 serves as a reliable and effective retrieval method for various software engineering RAG applications.
System builders should prioritize retrieval-side enhancements when developing RAG for code-related tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Lexical retrieval like BM25 may excel in code tasks because exact matches to identifiers and syntax are critical.
These findings could extend to other retrieval-heavy domains beyond software engineering.
Developers might achieve better results by combining strong retrievers with simpler generators to reduce costs.

Load-bearing premise

That the results from the three specific SE tasks and chosen models and datasets will hold for other software engineering problems and real-world codebases.

What would settle it

Running the same component comparisons on additional SE tasks such as code summarization for larger projects or different programming languages and checking if the retriever still dominates performance.

Figures

Figures reproduced from arXiv: 2605.14503 by Haoyu Wang, Hongjin Leng, Qiang Ke, Shengming Zhao, Yanjie Zhao.

**Figure 2.** Figure 2: The Extractive-Abstractive Pipeline of the Zero-shot Recomp Adaptation. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The Decision Logic of the LLM-Driven Adaptive RAG Framework. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Performance of context compression methods across five evaluation scenarios. The charts illustrate [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison of different generators across all tasks. RAG performance is shown for [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of the number of retrieved documents ( [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-temporal validation of core RAG findings. The relative performance trends across (a) Query [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

While Retrieval-Augmented Generation (RAG) is increasingly adopted to ground Large Language Models (LLMs) in software artifacts, the optimal configuration of its components remains an open question for software engineering (SE) tasks. The lack of systematic guidance forces practitioners into costly, ad-hoc experimentation. This paper presents a comprehensive, component-wise empirical study that dissects the RAG pipeline, evaluating over 21 distinct models and methods. Our study systematically isolates and evaluates 4 query processing techniques, 7 retrieval models spanning sparse, dense, and hybrid paradigms, 4 context refinement methods, and 6 distinct generators. We test these components on a suite of 3 core SE tasks: code generation, summarization, and repair. Our empirical findings reveal a crucial insight: the retriever-side components, particularly the choice of the retrieval algorithm, often exert a more significant influence on final system performance than the selection of the generator model. Strikingly, the classic lexical retriever BM25 demonstrates exceptionally robust performance across diverse tasks. Our analysis provides a practical, data-driven roadmap for researchers and practitioners, offering clear guidance on prioritizing optimization efforts when constructing effective RAG systems for software engineering contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Retrieval algorithm choice drives more of the performance difference than generator selection in this RAG study for SE tasks, and BM25 holds up well, but the size of that difference needs explicit variance numbers to land cleanly.

read the letter

The main point is that retriever-side decisions, especially which retrieval algorithm you pick, tend to move the needle more than swapping the generator model, and the old BM25 baseline stays competitive across the tasks they tested. They broke the pipeline into query processing, retrieval, refinement, and generation, then ran a large grid with 4 query methods, 7 retrievers, 4 refinement steps, and 6 generators on code generation, summarization, and repair. That scale is bigger than most SE RAG papers, and it gives practitioners a clearer ordering of where to spend tuning effort first. The robustness of BM25 is the kind of result that can actually change what people try first in a real codebase. The work is straightforward empirical benchmarking done at reasonable breadth, and the practical roadmap it sketches is the part that feels most useful right now. The soft spot is the central claim about relative influence. To say retriever components exert more influence, you really want to see the performance spread when you hold the generator fixed and vary the retriever, then compare that spread to the spread when you hold the retriever fixed and vary the generator. If the paper only reports per-component best scores or unnormalized averages without those matched deltas or a simple effect-size check, the ranking rests on an assumption that the observed differences are directly comparable. The abstract also does not mention run variance or statistical tests, which leaves the stability of the ordering unclear. The three tasks are reasonable starting points but still narrow, so the priority list might shift on other SE problems like defect prediction or test generation. This is the kind of paper that belongs in a reading group focused on LLM tooling for code. It is worth citing if you are choosing components for an SE RAG system, because the component ordering is concrete even if the exact magnitude claim needs a closer look at the tables. It deserves peer review because the empirical scope is there and the question is one teams are already asking.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a component-wise empirical study of Retrieval-Augmented Generation (RAG) pipelines for three software engineering tasks (code generation, summarization, and repair). It isolates and evaluates 4 query processing techniques, 7 retrieval models (sparse, dense, and hybrid), 4 context refinement methods, and 6 generators, reporting that retriever-side components—particularly the choice of retrieval algorithm—exert greater influence on final performance than generator selection, with the classic BM25 retriever showing robust results across tasks.

Significance. If the comparative influence findings hold under controlled analysis, the work supplies actionable, data-driven guidance for SE practitioners constructing RAG systems and highlights that retrieval choices may warrant higher priority than generator upgrades. The emphasis on BM25's consistent performance offers a concrete, low-cost baseline that could reduce reliance on expensive neural retrievers in code-related applications.

major comments (2)

[Results and Analysis] The central claim that retriever-side components exert more influence than generator selection requires matched ablations: performance deltas (or ranges) across the 7 retrieval models for each fixed generator must be directly compared against deltas across the 6 generators for each fixed retriever (e.g., via max or average spread, or factorial ANOVA). The abstract and results presentation do not report such effect-size comparisons, so the 'more significant' assertion rests on an unverified assumption about comparable magnitudes.
[Experimental Setup] No details are supplied on statistical testing, variance or standard deviation across runs, confidence intervals, or exact dataset sizes and splits. This absence makes it impossible to determine whether observed component rankings and task differences are reliable or could be artifacts of single-run noise or metric scaling.

minor comments (1)

[Abstract] The abstract states 'over 21 distinct models and methods' while the component counts sum exactly to 21; confirm that the full text consistently reports the total number of unique RAG configurations actually evaluated rather than the sum of component options.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Results and Analysis] The central claim that retriever-side components exert more influence than generator selection requires matched ablations: performance deltas (or ranges) across the 7 retrieval models for each fixed generator must be directly compared against deltas across the 6 generators for each fixed retriever (e.g., via max or average spread, or factorial ANOVA). The abstract and results presentation do not report such effect-size comparisons, so the 'more significant' assertion rests on an unverified assumption about comparable magnitudes.

Authors: We agree that quantifying the relative influence through matched effect-size comparisons would strengthen the central claim. In the revised manuscript, we will add a dedicated analysis computing performance deltas (max-min ranges and average spreads) across the 7 retrieval models for each fixed generator, and directly compare these to the deltas across the 6 generators for each fixed retriever. These results will be presented in an additional table or figure in the results section, with discussion of the magnitudes. We will also update the abstract to reference the quantified comparison. This revision directly addresses the concern. revision: yes
Referee: [Experimental Setup] No details are supplied on statistical testing, variance or standard deviation across runs, confidence intervals, or exact dataset sizes and splits. This absence makes it impossible to determine whether observed component rankings and task differences are reliable or could be artifacts of single-run noise or metric scaling.

Authors: We acknowledge that these experimental details were omitted. In the revision, we will explicitly report the exact dataset sizes and splits used for each of the three tasks (code generation, summarization, and repair). However, all experiments were conducted as single runs per configuration to manage the substantial computational cost of the full combinatorial evaluation. As a result, we do not have variance, standard deviations, or confidence intervals from multiple runs, and cannot add statistical testing without new experiments. We will state this limitation clearly in the experimental setup section and discuss its implications for interpreting the rankings. revision: partial

standing simulated objections not resolved

The absence of variance, standard deviations, confidence intervals, and formal statistical testing, which cannot be added without re-running the full set of experiments with multiple random seeds.

Circularity Check

0 steps flagged

No circularity: purely empirical component comparison

full rationale

The paper performs a direct empirical ablation across 4 query processors, 7 retrievers, 4 refiners, and 6 generators on three fixed SE tasks using standard metrics. No equations, fitted parameters, or predictions appear; all reported influences are measured performance deltas on external datasets. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling exist. The retriever-vs-generator claim rests on observed spreads rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard benchmarking assumptions about task representativeness and metric validity; no free parameters are fitted to produce the central claim and no new entities are postulated.

axioms (1)

domain assumption The three selected SE tasks and the chosen models/datasets are sufficiently representative to support general recommendations about component importance.
Invoked when generalizing the observed retriever dominance to broader RAG practice in software engineering.

pith-pipeline@v0.9.0 · 5521 in / 1152 out tokens · 59265 ms · 2026-05-15T01:48:57.380310+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 16 internal anchors

[1]

Marah Abdin, Jyoti Aneja, Hany Awadalla, et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning Distributed Representations of Code.Proc. ACM Program. Lang.3, POPL (2019), 40:1–40:29. doi:10.1145/3290353

work page doi:10.1145/3290353 2019
[3]

Muhammad Arslan, Hussam Ghanem, Saba Munawar, and Christophe Cruz. 2024. A Survey on RAG with LLMs. Procedia Computer Science246 (2024), 3781–3790. doi:10.1016/j.procs.2024.09.178

work page doi:10.1016/j.procs.2024.09.178 2024
[4]

Mihir Athale and Vishal Vaddina. 2025. Knowledge Graph Based Repository-Level Code Generation. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, Piscataway, NJ, USA, 169–176. doi:10. 1109/llm4code66737.2025.00026

work page arXiv 2025
[5]

Nguyen, Hridesh Rajan, Nikolaos Tsantalis, and Danny Dig

Abhiram Bellur, Fraol Batole, Mohammed Raihan Ullah, Malinda Dilhara, Yaroslav Zharov, Timofey Bryksin, Kai Ishikawa, Haifeng Chen, Masaharu Morimoto, Takeo Hosomi, Tien N. Nguyen, Hridesh Rajan, Nikolaos Tsantalis, and Danny Dig. 2025. Together We are Better: LLM, IDE and Semantic Embedding to Assist Move Method Refactoring. In Proceedings of the 41st IE...

work page doi:10.1109/icsme64153.2025.00046 2025
[6]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, et al . 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, et al. 2020. Unsupervised Cross-lingual Representation Learning at Scale. arXiv:1911.02116 [cs.CL]

work page arXiv 2020
[9]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Bottcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, James Allan, Javed A. Aslam, Mark Sanderson, ChengXiang Zhai, and Justin Zobel (Eds.). ACM,...

work page doi:10.1145/1571941.1572114 2009
[10]

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, et al. 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, Barcelona Spain, 6491–6501. doi:10.1145/3637528. 3671470

work page doi:10.1145/3637528 2024
[12]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, et al . 2020. Codebert: A pre-trained model for programming and natural languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Stroudsburg, PA, USA, 1536–1547

work page 2020
[13]

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. Precise Zero-Shot Dense Retrieval without Relevance Labels. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for C...

work page doi:10.18653/v1/2023.acl-long.99 2023
[14]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey.IEEE Transactions on Software Engineering45, 1 (2019), 34–67. doi:10.1109/TSE.2017.2755013

work page doi:10.1109/tse.2017.2755013 2019
[16]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Wenchao Gu, Juntao Chen, Yanlin Wang, Tianyue Jiang, Xingzhe Li, Mingwei Liu, Xilin Liu, Yuchi Ma, and Zibin Zheng. 2025. What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond. arXiv:2503.20589 [cs.SE]

work page arXiv 2025
[18]

Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. 2024. Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents. arXiv:2310.19923 [cs.CL]

work page arXiv 2024
[19]

Dan Hendrycks, Steven Basart, Saurav Kadavath, et al. 2021. Measuring Coding Challenge Competence With APPS. arXiv:2105.09938 [cs.SE] Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE183. Publication date: July 2026. FSE183:22 Q. Ke, Y. Zhao, H. Leng, S. Zhao, and H. Wang

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

work page
[21]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring Mathematical Problem Solving With the MATH Dataset. arXiv:2103.03874 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. arXiv:2007.01282 [cs.CL]

work page arXiv 2021
[23]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025. OpenReview.net, Amherst, MA, USA, 1–15

work page 2025
[24]

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. LongLLM- Lingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand August 11-16, ...

work page doi:10.18653/v1/2024.acl-long.91 2024
[25]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs.IEEE Transactions on Big Data7, 3 (2021), 535–547. doi:10.1109/TBDATA.2019.2921572

work page doi:10.1109/tbdata.2019.2921572 2021
[26]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, et al . 2020. Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2004.04906 [cs.CL]

work page arXiv 2020
[27]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv:2308.03281 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Ye Liu, Rui Meng, Shafiq Jot, Silvio Savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. 2025. CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval. arXiv:2411.12644 [cs.CL]

work page arXiv 2025
[30]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, et al. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv:2102.04664 [cs.SE]

work page arXiv 2021
[31]

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query Rewriting for Retrieval-Augmented Large Language Models. arXiv:2305.14283 [cs.CL]

work page arXiv 2023
[32]

Mayank Mishra, Matt Stallone, Gaoyuan Zhang, et al . 2024. Granite Code Models: A Family of Open Foundation Models for Code Intelligence. arXiv:2405.04324 [cs.AI]

work page arXiv 2024
[33]

Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[34]

OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval Augmented Code Generation and Summarization. arXiv:2108.11601 [cs.SE]

work page arXiv 2021
[36]

Shuo Ren, Daya Guo, Shuai Lu, et al . 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[37]

S. E. Robertson and S. Walker. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. InSIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer-Verlag, London, UK, 232–241

work page 1994
[38]

Jiho Shin, Reem Aleithan, Hadi Hemmati, and Song Wang. 2024. Retrieval-Augmented Test Generation: How Far Are We? arXiv:2409.12682 [cs.SE]

work page arXiv 2024
[39]

Lu Shuai, Duan Nan, Han Hojae, Guo Daya, Hwang Seung-won, and Svyatkovskiy Alexey. 2022. ReACC: A Retrieval- Augmented Code Completion Framework. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 6227–6240. doi:10. 18653/v1/2022.acl...

work page 2022
[40]

Jonathan Sillito, Frank Maurer, Seyed Mehdi Nasehi, and Chris Burns. 2012. What makes a good code example? A study of programming Q and A in StackOverflow. InProceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM). IEEE, Piscataway, NJ, USA, 25–34. doi:10.1109/ICSM.2012.6405249

work page doi:10.1109/icsm.2012.6405249 2012
[41]

Karen Spärck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval.Journal of Documentation28, 1 (1972), 11–21

work page 1972
[42]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, et al . 2024. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Associat...

work page doi:10.18653/v1/2023.emnlp- 2024
[43]

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, and Maosong Sun. 2024. DebugBench: Evaluating Debugging Capability of Large Language Mod- els. arXiv:2401.04621 [cs.SE]

work page arXiv 2024
[44]

Christoph Treude, Ohad Barzilay, and Margaret-Anne Storey. 2011. How do programmers ask and answer questions on the web? (NIER track). InProceedings of the 33rd International Conference on Software Engineering. ACM, New York, NY, USA, 804–807. doi:10.1145/1985793.1985907

work page doi:10.1145/1985793.1985907 2011
[45]

Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and Michael R. Lyu. 2022. No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 22). ACM, New York, NY, USA, 382–...

work page doi:10.1145/3540250.3549113 2022
[46]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, et al. 2024. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv:2212.03533 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Improving Text Embeddings with Large Language Models. arXiv:2401.00368 [cs.CL]

work page arXiv 2024
[48]

Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, and Xuanjing Huang. 2024. Searching for Best Practices in Retrieval-Augmented Generation. arXiv:2407.01219 [cs.CL]

work page arXiv 2024
[49]

Yuan Wang, Xuyang Wu, Hsin-Tai Wu, Zhiqiang Tao, and Yi Fang. 2024. Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers. arXiv:2404.03192 [cs.IR]

work page arXiv 2024
[50]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., Red Hook, NY, USA,...

work page 2022
[51]

Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation. arXiv:2310.04408 [cs.CL]

work page arXiv 2023
[52]

An Yang, Baosong Yang, Beichen Zhang, et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, and Xin Xia. 2025. An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities.ACM Trans. Softw. Eng. Methodol.34, 7 (2025), 188:1–188:28. doi:10.1145/3717061

work page doi:10.1145/3717061 2025
[54]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig

Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. 2023. DocPrompting: Generating Code by Retrieving the Docs. arXiv:2207.05987 [cs.SE]

work page arXiv 2023
[56]

Xiaoling Zhou, Ou Wu, Weiyao Zhu, and Ziyang Liang. 2022. Understanding Difficulty-Based Sample Weighting with a Universal Difficulty Measure. InMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part III. Springer-Verlag, Cham, Switzerland, 68–84. doi:10.1007/9...

work page doi:10.1007/978-3-031-26409-2_5 2022