pith. machine review for the scientific record. sign in

arxiv: 2604.22661 · v1 · submitted 2026-04-24 · 💻 cs.IR · cs.CL

Recognition: unknown

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:16 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords query performance predictionretrieval-augmented generationquery reformulationquery variant selectioninformation retrievalRAG pipelines
0
0 comments X

The pith

Query performance prediction can select effective variants among reformulated queries to improve RAG output over the original query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether Query Performance Prediction (QPP) can choose the best query variant inside a RAG pipeline before running expensive retrieval and generation steps. It shifts focus from predicting difficulty across different topics to discriminating among several reformulations of one information need. Large experiments on TREC-RAG data with both sparse and dense retrievers show that QPP scores often point to variants that raise final answer quality. Pre-retrieval predictors turn out to be nearly as effective as post-retrieval ones, which is attractive for low-latency use. The work also documents a consistent mismatch: the variants that score highest on ranking metrics frequently do not produce the strongest generated answers.

Core claim

QPP applied to intra-topic variant selection can identify reformulations that raise end-to-end RAG quality above the original query; lightweight pre-retrieval predictors frequently match or exceed the accuracy of more expensive post-retrieval predictors, even while retrieval-optimal variants and generation-optimal variants diverge.

What carries the argument

Intra-topic QPP discrimination, which scores competing reformulations of the same query to predict downstream retrieval and generation quality without executing the full pipeline for each.

If this is right

  • RAG pipelines can run full retrieval and generation only on the single variant predicted to be best, cutting compute for the others.
  • Pre-retrieval QPP supplies a low-latency filter that still improves final answers over using the original query.
  • Retrieval metrics such as nDCG are imperfect proxies for generation success, so variant selection must target generation objectives directly.
  • Both sparse and dense retrievers benefit from the same QPP-based selection approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection logic could be applied to other multi-prompt LLM workflows where several phrasings of a request are generated before expensive inference.
  • Production systems might combine QPP scores with lightweight online feedback to adapt variant choice per user or topic over time.
  • The documented retrieval-generation gap implies that future RAG training objectives should optimize for answer fidelity rather than ranking metrics alone.

Load-bearing premise

That the observed link between QPP scores and measured RAG answer quality will hold for other collections, retrievers, and generators beyond those tested.

What would settle it

Repeating the experiments on a fresh dataset or with different LLMs and finding that top-scoring QPP variants no longer produce higher answer quality than the original query or lower-scoring variants.

Figures

Figures reproduced from arXiv: 2604.22661 by Andrew Drozdov, Matei Zaharia, Michael Bendersky, Negar Arabzadeh.

Figure 1
Figure 1. Figure 1: Relationship between retrieval effectiveness (nDCG@5) and end-to-end RAG utility (Nugget-All) under sparse and view at source ↗
read the original abstract

Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a "utility gap" between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Query Performance Prediction (QPP) methods—particularly lightweight pre-retrieval predictors—can reliably select query variants that improve end-to-end RAG quality over the original query on the TREC-RAG collection. Experiments with sparse and dense retrievers show a utility gap between retrieval metrics (e.g., nDCG) and generation quality, yet QPP enables selective execution that reduces latency while maintaining or improving performance under both correlation and decision-based metrics.

Significance. If the empirical findings hold under broader conditions, the work offers a practical, low-overhead technique for robust RAG pipelines by avoiding full execution of all query variants. It also surfaces a systematic mismatch between retrieval and generation objectives that future RAG research must address.

major comments (3)
  1. [§4] §4 (Experiments) and results tables: the claim that QPP 'reliably identify[s] variants that improve end-to-end quality' is not supported by reported statistical significance tests, confidence intervals, or error bars on the decision-based improvements; without these, it is impossible to judge whether observed gains exceed noise.
  2. [§5] §5 (Discussion/Conclusion): the assertion that lightweight pre-retrieval QPP offers a 'latency-efficient approach to robust RAG' extrapolates beyond the tested setting; all results are confined to TREC-RAG with specific retrievers and (unspecified) generators, and no cross-collection or cross-generator validation is performed to test stability of intra-topic discrimination.
  3. [§3] §3 (Methodology): exact decision thresholds used for the 'decision-based metrics' (e.g., when a variant is deemed better than the original) are not stated, nor is the procedure for choosing among multiple variants when QPP scores are tied or close.
minor comments (2)
  1. [Abstract] Abstract: the specific LLM generators used in the RAG pipeline are not named, which is required for reproducibility of the generation-quality results.
  2. [Figures/Tables] Figure captions and tables: axis labels and metric definitions (e.g., exact formulation of the decision-based success rate) should be expanded for readers unfamiliar with the precise QPP-to-RAG mapping.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us identify areas for improvement in clarity and rigor. We address each major comment point by point below and describe the corresponding revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and results tables: the claim that QPP 'reliably identify[s] variants that improve end-to-end quality' is not supported by reported statistical significance tests, confidence intervals, or error bars on the decision-based improvements; without these, it is impossible to judge whether observed gains exceed noise.

    Authors: We agree that statistical significance testing and uncertainty quantification are necessary to substantiate the decision-based improvements. In the revised manuscript, we will add bootstrap confidence intervals (with 1000 resamples) for all reported gains in end-to-end RAG quality and apply paired statistical tests (McNemar's test for binary decisions and paired t-tests for continuous metrics) to assess whether improvements over the original query are significant at p<0.05. Error bars will be included in the relevant tables and figures. revision: yes

  2. Referee: [§5] §5 (Discussion/Conclusion): the assertion that lightweight pre-retrieval QPP offers a 'latency-efficient approach to robust RAG' extrapolates beyond the tested setting; all results are confined to TREC-RAG with specific retrievers and (unspecified) generators, and no cross-collection or cross-generator validation is performed to test stability of intra-topic discrimination.

    Authors: We acknowledge that our experiments are limited to the TREC-RAG collection and the specific retrievers and generators described in the paper. We will revise the discussion and conclusion to explicitly qualify our claims, stating that the latency-efficiency benefits and intra-topic discrimination results hold under the tested conditions and that broader validation across collections and generators is needed to confirm stability. We will also specify the exact generators used in the experiments. revision: partial

  3. Referee: [§3] §3 (Methodology): exact decision thresholds used for the 'decision-based metrics' (e.g., when a variant is deemed better than the original) are not stated, nor is the procedure for choosing among multiple variants when QPP scores are tied or close.

    Authors: We appreciate this observation. In the revised methodology section (§3), we will explicitly define the decision thresholds (a variant is deemed better than the original if its QPP score exceeds the original by at least 0.05 in normalized terms) and the tie-breaking rule (select the variant with the highest QPP score; if scores are identical, select randomly). revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation on external benchmarks

full rationale

The paper reports large-scale experiments on the TREC-RAG collection, measuring correlations and decision-based performance of pre- and post-retrieval QPP predictors for selecting query variants in RAG pipelines. No derivations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. All claims rest on observed results from external data and standard metrics rather than internal definitions or author-prior uniqueness theorems. This is the expected non-finding for an empirical evaluation study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical benchmarking study; it introduces no new free parameters, mathematical axioms, or postulated entities beyond standard IR evaluation practices.

pith-pipeline@v0.9.0 · 5541 in / 1109 out tokens · 102293 ms · 2026-05-08T10:16:16.963524+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 20 canonical work pages · 2 internal anchors

  1. [1]

    Marwah Alaofi, Negar Arabzadeh, Charles LA Clarke, and Mark Sanderson. 2024. Generative information retrieval evaluation. InInformation access in the era of generative ai. Springer, 135–159

  2. [2]

    Negar Arabzadeh and Ebrahim Bagheri. 2025. VAP3: Variation-Aware Prompt Performance Prediction. InProceedings of the 48th International ACM SIGIR Con- ference on Research and Development in Information Retrieval. 2794–2799

  3. [3]

    Negar Arabzadeh, Amin Bigdeli, and Charles LA Clarke. 2024. Adapting standard retrieval benchmarks to evaluate generated answers. InEuropean Conference on Information Retrieval. Springer, 399–414

  4. [4]

    Negar Arabzadeh, Amin Bigdeli, Morteza Zihayat, and Ebrahim Bagheri. 2021. Query Performance Prediction Through Retrieval Coherency. InAdvances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43. Springer, 193–200

  5. [5]

    Negar Arabzadeh and Charles LA Clarke. 2024. A comparison of methods for evaluating generative ir.arXiv preprint arXiv:2404.04044(2024)

  6. [6]

    Negar Arabzadeh, Radin Hamidi Rad, Maryam Khodabakhsh, and Ebrahim Bagheri. 2023. Noisy perturbations for estimating query difficulty in dense retrievers. InProceedings of the 32nd ACM international conference on information and knowledge management. 3722–3727

  7. [7]

    Negar Arabzadeh, Maryam Khodabakhsh, and Ebrahim Bagheri. 2021. BERT-QPP: Contextualized Pre-trained transformers for Query Performance Prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management(Virtual Event, Queensland, Australia)(CIKM ’21). Association for Computing Machinery, New York, NY, USA, 2857–2861....

  8. [8]

    Negar Arabzadeh, Maryam Khodabakhsh, and Ebrahim Bagheri. 2021. BERT-QPP: Contextualized Pre-trained transformers for Query Performance Prediction. In CIKM

  9. [10]

    InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

    Query performance prediction: Techniques and applications in modern information retrieval. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 291–294

  10. [11]

    Negar Arabzadeh, Chuan Meng, Mohammad Aliannejadi, and Ebrahim Bagheri

  11. [12]

    In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining

    Query performance prediction: Theory, techniques and applications. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining. 991–994

  12. [13]

    Negar Arabzadeh, Mahsa Seifikar, and Charles LA Clarke. 2022. Unsupervised question clarity prediction through retrieved item coherency. InProceedings of the 31st ACM International Conference on Information & Knowledge Management

  13. [14]

    Negar Arabzadeh, Fattane Zarrinkalam, Jelena Jovanovic, Feras Al-Obeidat, and Ebrahim Bagheri. 2020. Neural embedding-based specificity metrics for pre- retrieval query performance prediction.Information Processing & Management 57, 4 (2020), 102248. doi:10.1016/j.ipm.2020.102248

  14. [15]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi

  15. [16]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection. (2024)

  16. [17]

    Amin Bigdeli, Negar Arabzadeh, and Ebrahim Bagheri. 2024. Learning to jointly transform and rank difficult queries. InEuropean Conference on Information Retrieval. Springer, 40–48

  17. [18]

    Amin Bigdeli, Negar Arabzadeh, Ebrahim Bagheri, and Charles LA Clarke. 2024. Evaluating relative retrieval effectiveness with normalized residual gain. InPro- ceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 64–71

  18. [19]

    Amin Bigdeli, Sajad Ebrahimi, Negar Arabzadeh, Sara Salamat, Shirin SeyedSalehi, Maryam Khodabakhsh, Fattane Zarrinkalam, and Ebrahim Bagheri. 2025. Query Performance Prediction Using Neural Query Space Proximity.ACM Trans. Intell. Syst. Technol.(Sept. 2025). doi:10.1145/3762197

  19. [20]

    Amin Bigdeli, Sajad Ebrahimi, Negar Arabzadeh, Sara Salamat, Shirin Seyedsalehi, Maryam Khodabakhsh, Fattane Zarrinkalam, and Ebrahim Bagheri. 2025. Query Performance Prediction Using Neural Query Space Proximity.ACM Transactions on Intelligent Systems and Technology17, 1 (2025), 1–25

  20. [21]

    Amin Bigdeli, Mert Incesu, Negar Arabzadeh, Charles LA Clarke, and Ebrahim Bagheri. 2026. ReFormeR: Learning and Applying Explicit Query Reformulation Patterns. InEuropean Conference on Information Retrieval. Springer, 400–408

  21. [22]

    Amin Bigdeli, Radin Hamidi Rad, Mert Incesu, Negar Arabzadeh, Charles LA Clarke, and Ebrahim Bagheri. 2025. QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation.arXiv preprint arXiv:2511.15996(2025)

  22. [23]

    2010.Estimating the Query Difficulty for In- formation Retrieval

    David Carmel and Elad Yom-Tov. 2010.Estimating the Query Difficulty for In- formation Retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services, Vol. 2. Morgan & Claypool Publishers. 1–89 pages

  23. [24]

    David Carmel, Elad Yom-Tov, Adam Darlow, and Dan Pelleg. 2006. What makes a query difficult?. InProceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 390–397

  24. [25]

    Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. Rq-rag: Learning to refine queries for retrieval augmented generation.arXiv preprint arXiv:2404.00610(2024)

  25. [26]

    Adrian-Gabriel Chifu, Sébastien Déjean, Moncef Garouani, Josiane Mothe, Diégo Ortiz, and Md Zia Ullah. 2025. Uncovering the Limitations of Query Performance Prediction: Failures, Insights, and Implications for Selective Query Processing. ACM Transactions on Information Systems(2025)

  26. [27]

    Steve Cronen-Townsend, Yun Zhou, and W Bruce Croft. 2002. Predicting query performance. InProceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. 299–306

  27. [28]

    Ronan Cummins, Joemon Jose, and Colm O’Riordan. 2011. Improved query performance prediction using standard deviation. InProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 1089–1090

  28. [29]

    Suchana Datta, Sean MacAvaney, Debasis Ganguly, and Derek Greene. 2022. A’Pointwise-Query, Listwise-Document’based Query Performance Prediction Approach. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2148–2153

  29. [30]

    Kaustubh D Dhole and Eugene Agichtein. 2024. Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation. InEuropean Conference on Information Retrieval. Springer, 326–335

  30. [31]

    Sajad Ebrahimi, Maryam Khodabakhsh, Negar Arabzadeh, and Ebrahim Bagheri

  31. [32]

    InEuropean Conference on Information Retrieval

    Estimating query performance through rich contextualized query repre- sentations. InEuropean Conference on Information Retrieval. Springer, 49–58

  32. [33]

    Guglielmo Faggioli, Nicola Ferro, Cristina Ioana Muntean, Raffaele Perego, and Nicola Tonellotto. 2023. A geometric framework for query performance prediction in conversational search. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1355–1365

  33. [34]

    Guglielmo Faggioli, Oleg Zendel, J Shane Culpepper, Nicola Ferro, and Falk Scholer. 2022. sMARE: a new paradigm to evaluate and understand query perfor- mance prediction methods.Information Retrieval Journal25, 2 (2022), 94–122

  34. [35]

    Furnas, Thomas K

    George W. Furnas, Thomas K. Landauer, Louis M. Gomez, and Susan T. Dumais

  35. [36]

    ACM 30, 11 (1987), 964–971

    The vocabulary problem in human-system communication.Commun. ACM 30, 11 (1987), 964–971

  36. [37]

    Debasis Ganguly, Suchana Datta, Mandar Mitra, and Derek Greene. 2022. An analysis of variations in the effectiveness of query performance prediction. In European Conference on Information Retrieval. Springer, 215–229

  37. [38]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. 2, 1 (2023)

  38. [39]

    Helia Hashemi, Hamed Zamani, and W Bruce Croft. 2019. Performance Prediction for Non-Factoid Question Answering. InProceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. 55–58

  39. [40]

    Claudia Hauff. 2010. Predicting the effectiveness of queries and retrieval systems. InSIGIR Forum, Vol. 44. 88

  40. [41]

    Claudia Hauff, Leif Azzopardi, and Djoerd Hiemstra. 2009. The combination and evaluation of query performance prediction methods. InEuropean conference on information retrieval. Springer, 301–312

  41. [42]

    Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, and Franciska de Jong. 2010. Query performance prediction: Evaluation contrasted with effectiveness. InEu- ropean Conference on Information Retrieval. Springer, 204–216

  42. [43]

    Ben He and Iadh Ounis. 2004. Inferring Query Performance Using Pre-retrieval Predictors.. InString Processing and Information Retrieval, 11th International Conference, SPIRE 2004, Padova, Italy, October 5-8, 2004, Proceedings. 43–54. doi:10. 1007/978-3-540-30213-1_5

  43. [44]

    Seyed Mohammad Hosseini, Negar Arabzadeh, Morteza Zihayat, and Ebrahim Bagheri. 2024. Enhanced retrieval effectiveness through selective query genera- tion. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 3792–3796

  44. [45]

    Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bender- sky. 2023. Query expansion by prompting large language models.arXiv preprint arXiv:2305.03653(2023)

  45. [46]

    Evangelos Kanoulas, Dan Li, Leif Azzopardi, and Rene Spijker. 2018. CLEF 2018 technologically assisted reviews in empirical medicine overview. InCEUR workshop proceedings, Vol. 2125. CEUR-WS

  46. [47]

    Maryam Khodabakhsh, Fattane Zarrinkalam, and Negar Arabzadeh. 2024. BertPE: a BERT-based pre-retrieval estimator for query performance prediction. InEuro- pean Conference on Information Retrieval. Springer, 354–363

  47. [48]

    Julian Killingback and Hamed Zamani. 2025. Benchmarking Information Retrieval Models on Complex Retrieval Tasks.arXiv preprint arXiv:2509.07253(2025)

  48. [49]

    Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. 2025. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technol...

  49. [50]

    Kuilam L Kwok. 1996. A new method of weighting query terms for ad-hoc retrieval. InProceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. 187–195. Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia

  50. [51]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

  51. [52]

    Hang Li, Ahmed Mourad, Shengyao Zhuang, Bevan Koopman, and Guido Zuccon

  52. [53]

    Pseudo relevance feedback with deep language models and dense retrievers: Successes and pitfalls.ACM Transactions on Information Systems41, 3 (2023)

  53. [54]

    Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query rewriting in retrieval-augmented large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 5303–5315

  54. [55]

    Chuan Meng, Negar Arabzadeh, Mohammad Aliannejadi, and Maarten de Rijke

  55. [56]

    arXiv preprint arXiv:2305.10923(2023)

    Query Performance Prediction: From Ad-hoc to Conversational Search. arXiv preprint arXiv:2305.10923(2023)

  56. [57]

    Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. 2025. Query performance prediction using relevance judgments generated by large language models.ACM Transactions on Information Systems 43, 4 (2025), 1–35

  57. [58]

    Bhaskar Miutra and Nick Craswell. 2018. An introduction to neural information retrieval.Foundations and Trends ˆW in Accounting13, 1 (2018), 1–126

  58. [59]

    Joaquín Pérez-Iglesias and Lourdes Araujo. 2010. Standard deviation as a query hardness estimator. InInternational Symposium on String Processing and Informa- tion Retrieval. Springer, 207–212

  59. [60]

    Fabio Petroni, Federico Siciliano, Fabrizio Silvestri, and Giovanni Trappolini. 2024. IR-RAG@ SIGIR24: Information retrieval’s role in RAG systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3036–3039

  60. [61]

    Jay M Ponte and W Bruce Croft. 2017. A language modeling approach to in- formation retrieval. InACM SIGIR Forum, Vol. 51. ACM New York, NY, USA, 202–208

  61. [62]

    Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, and Jimmy Lin. 2025. Ragnarök: A reusable RAG framework and baselines for TREC 2024 retrieval-augmented generation track. InEuropean Conference on Information Retrieval. Springer

  62. [63]

    Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Initial nugget evaluation results for the trec 2024 rag track with the autonuggetizer framework.arXiv preprint arXiv:2411.09607 (2024)

  63. [64]

    Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Anan- thanarayanan, Ravi Netravali, and Junchen Jiang. 2025. Metis: fast quality-aware rag systems with configuration adaptation. InProceedings of the ACM SIGOPS 31st symposium on operating systems principles. 606–622

  64. [65]

    Amirhossein Razavi, Mina Soltangheis, Negar Arabzadeh, Sara Salamat, Morteza Zihayat, and Ebrahim Bagheri. 2025. Benchmarking prompt sensitivity in large language models. InEuropean Conference on Information Retrieval. Springer, 303–313

  65. [66]

    Haggai Roitman, Shai Erera, and Guy Feigenblat. 2019. A study of query perfor- mance prediction for answer quality determination. InProceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. 43–46

  66. [67]

    Haggai Roitman, Shai Erera, and Bar Weiner. 2017. Robust standard deviation estimation for query performance prediction. InProceedings of the acm sigir international conference on theory of information retrieval. 245–248

  67. [68]

    Sara Salamat, Negar Arabzadeh, Shirin Seyedsalehi, Amin Bigdeli, Morteza Zi- hayat, and Ebrahim Bagheri. 2023. Neural Disentanglement of Query Difficulty and Semantics. InCIKM. 4264–4268

  68. [69]

    Sara Salamat, Negar Arabzadeh, Shirin Seyedsalehi, Amin Bigdeli, Morteza Zi- hayat, and Ebrahim Bagheri. 2025. A contrastive neural disentanglement ap- proach for query performance prediction.Machine Learning114, 4 (2025), 109

  69. [70]

    Alireza Salemi and Hamed Zamani. 2024. Evaluating retrieval quality in retrieval- augmented generation. InProceedings of the 47th International ACM SIGIR Con- ference on Research and Development in Information Retrieval. 2395–2400

  70. [72]

    InEuropean Conference on Information Retrieval

    Context-aware query term difficulty estimation for performance prediction. InEuropean Conference on Information Retrieval. Springer, 30–39

  71. [73]

    Abbas Saleminezhad, Negar Arabzadeh, Soosan Beheshti, and Ebrahim Bagheri

  72. [74]

    Learning Context-aware Term Importance for Query Performance Pre- diction.ACM Transactions on Intelligent Systems and Technology17, 2 (2026), 1–30

  73. [75]

    Abbas Saleminezhad, Negar Arabzadeh, Seyed Mohammad Hosseini, Soosan Beheshti, and Ebrahim Bagheri. 2026. Structure-Aware Pre-retrieval Performance Prediction on Query Affinity Graphs. InEuropean Conference on Information Retrieval. Springer, 547–556

  74. [76]

    Payel Santra, Partha Basuchowdhuri, and Debasis Ganguly. 2026. Beyond Corre- lations: A Downstream Evaluation Framework for Query Performance Prediction. arXiv preprint arXiv:2601.17339(2026)

  75. [77]

    Payel Santra, Partha Basuchowdhuri, and Debasis Ganguly. 2026. Breaking Flat: A Generalised Query Performance Prediction Evaluation Framework. arXiv:2601.17359 [cs.IR] https://arxiv.org/abs/2601.17359

  76. [78]

    Harrisen Scells, Leif Azzopardi, Guido Zuccon, and Bevan Koopman. 2018. Query variation performance prediction for systematic reviews. InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1089– 1092

  77. [79]

    Wonduk Seo, Hyunjin An, and Seunghyun Lee. 2025. A New Query Expansion Approach via Agent-Mediated Dialogic Inquiry. arXiv:2502.08557 [cs.IR] https: //arxiv.org/abs/2502.08557

  78. [80]

    Anna Shtok, Oren Kurland, and David Carmel. 2010. Using statistical decision theory and relevance models for query-performance prediction. InProceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 259–266

  79. [81]

    Anna Shtok, Oren Kurland, David Carmel, Fiana Raiber, and Gad Markovits. 2012. Predicting query performance by query-drift estimation.ACM Transactions on Information Systems (TOIS)30, 2 (2012), 1–35

  80. [82]

    Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agen- tic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136(2025)

Showing first 80 references.