Recognition: unknown
Beyond Relevance: Utility-Centric Retrieval in the LLM Era
Pith reviewed 2026-05-10 17:27 UTC · model grok-4.3
The pith
Retrieval for large language models must optimize for contribution to answer quality rather than topical relevance alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Information retrieval systems have traditionally optimized for topical relevance, yet in retrieval-augmented generation the retrieved documents act as evidence for LLMs that produce answers, so effectiveness must instead be evaluated by contribution to generation quality; this leads to evolving retrieval objectives from relevance-centric optimization toward LLM-centric utility, organized through a unified framework covering LLM-agnostic versus LLM-specific utility, context-independent versus context-dependent utility, and connections with LLM information needs and agentic RAG.
What carries the argument
Unified framework distinguishing LLM-agnostic versus LLM-specific utility and context-independent versus context-dependent utility in support of generation quality.
Load-bearing premise
Utility for LLMs can be defined and measured separately from relevance, and this distinction fundamentally alters the retrieval paradigm without requiring additional empirical validation of the framework.
What would settle it
A head-to-head experiment in which retrieval systems tuned for the proposed utility measures produce lower-quality LLM answers than systems tuned only for traditional relevance metrics on the same queries and tasks.
read the original abstract
Information retrieval systems have traditionally optimized for topical relevance-the degree to which retrieved documents match a query. However, relevance only approximates a deeper goal: utility, namely, whether retrieved information helps accomplish a user's underlying task. The emergence of retrieval-augmented generation (RAG) fundamentally changes this paradigm. Retrieved documents are no longer consumed directly by users but instead serve as evidence for large language models (LLMs) that produce answers. As a result, retrieval effectiveness must be evaluated by its contribution to generation quality rather than by relevance-based ranking metrics alone. This tutorial argues that retrieval objectives are evolving from relevance-centric optimization toward LLM-centric utility. We present a unified framework covering LLM-agnostic versus LLM-specific utility, context-independent versus context-dependent utility, and the connection with LLM information needs and agentic RAG. By synthesizing recent advances, the tutorial provides conceptual foundations and practical guidance for designing retrieval systems aligned with the requirements of LLM-based information access.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that traditional IR systems optimize for topical relevance, but the rise of RAG with LLMs requires a shift to utility-centric retrieval, where effectiveness is judged by contribution to downstream generation quality rather than relevance metrics. It presents a unified framework distinguishing LLM-agnostic vs. LLM-specific utility and context-independent vs. context-dependent utility, while connecting these to LLM information needs and agentic RAG, and synthesizes recent advances to offer conceptual foundations and practical guidance for LLM-aligned retrieval design.
Significance. If the conceptual reorientation holds, the tutorial provides a useful organizing lens for the field by synthesizing advances in utility definitions and RAG-specific retrieval, offering practical guidance that could influence system design as generative AI becomes dominant. The absence of new derivations, proofs, or empirical results means its value lies in synthesis rather than novel testable claims.
major comments (2)
- [§3] §3 (Framework): The central distinction between LLM-agnostic and LLM-specific utility is presented conceptually but lacks operational definitions or examples of how these would alter retrieval objectives or ranking functions compared to standard relevance; this weakens the claim that the paradigm fundamentally changes without additional validation mechanisms.
- [§4] §4 (Connection to agentic RAG): The argument that retrieval must support agentic workflows relies on cited prior work but does not demonstrate through any concrete case how utility metrics would be computed or optimized differently from relevance in multi-step agent scenarios, leaving the practical guidance underspecified.
minor comments (2)
- [Abstract] The abstract and introduction repeat the core claim about relevance approximating utility without citing a specific prior result or example that illustrates the approximation gap.
- [§3] Notation for utility types (e.g., U_LLM-agnostic) is introduced but not consistently used in later sections when discussing measurement.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for minor revision. We appreciate the recognition of the tutorial's value as a synthesis and will use the feedback to strengthen the practical aspects of the framework and guidance.
read point-by-point responses
-
Referee: [§3] §3 (Framework): The central distinction between LLM-agnostic and LLM-specific utility is presented conceptually but lacks operational definitions or examples of how these would alter retrieval objectives or ranking functions compared to standard relevance; this weakens the claim that the paradigm fundamentally changes without additional validation mechanisms.
Authors: We agree that additional operational detail would improve clarity. In the revised manuscript, we will expand §3 to include explicit operational definitions and illustrative examples. For LLM-agnostic utility, we will describe metrics such as document coverage or information gain that apply across models; for LLM-specific utility, we will show incorporation of model-dependent signals like predicted generation quality or token-level utility. We will also provide concrete examples of how these alter ranking objectives, such as replacing or augmenting relevance scores with utility predictors in a re-ranking step, drawing directly from synthesized prior work. This will demonstrate the shift without introducing new empirical claims. revision: yes
-
Referee: [§4] §4 (Connection to agentic RAG): The argument that retrieval must support agentic workflows relies on cited prior work but does not demonstrate through any concrete case how utility metrics would be computed or optimized differently from relevance in multi-step agent scenarios, leaving the practical guidance underspecified.
Authors: We acknowledge that the guidance in §4 would be strengthened by greater specificity. We will revise the section to include a detailed concrete case study of a multi-step agentic scenario (e.g., iterative multi-hop question answering). The example will illustrate computation of context-dependent utility at each step—based on the agent's intermediate state and projected impact on final output quality—versus static relevance, along with optimization approaches such as utility-guided retrieval planning. This will be supported by references to existing methods while remaining within the tutorial's synthetic scope. revision: yes
Circularity Check
No significant circularity; conceptual tutorial without derivation chain
full rationale
The paper is a tutorial synthesizing prior literature on retrieval for RAG, presenting a conceptual reorientation toward utility-centric evaluation. No equations, formal proofs, fitted parameters, or new empirical results are introduced that could reduce to self-definitions or self-citations. The central claim is supported by external citations rather than internal reductions, and the framework functions as an organizing lens with independent content from prior advances. This is self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Utility is best measured by downstream contribution to LLM generation quality rather than topical match
Reference graph
Works this paper leans on
-
[1]
Harry W Bruce. 1994. A cognitive view of the situational dynamism of user- centered relevance estimation.JASIST45, 3 (1994), 142–148
1994
-
[2]
Georg Buscher, Ludger Van Elst, and Andreas Dengel. 2009. Segment-level display time as implicit feedback: a comparison to eye tracking. InSIGIR’09. 67–74
2009
-
[3]
Lorenzo Canale, Stefano Scotta, Alberto Messina, and Laura Farinetti. 2025. BES4RAG: A Framework for Embedding Model Selection in Retrieval-Augmented Generation. InCLiC-it 2025. CEUR Workshop, Cagliari, Italy, 134–142
2025
-
[4]
William S Cooper. 1971. A definition of relevance for information retrieval. Information storage and retrieval7, 1 (1971), 19–37
1971
-
[5]
Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, and Hui Xiong. 2025. Seper: Measure retrieval utility through the lens of semantic perplexity reduction.ICLR’26(2025)
2025
-
[6]
Xinyi Dai, Jiawei Hou, Qing Liu, Yunjia Xi, Ruiming Tang, Weinan Zhang, Xi- uqiang He, Jun Wang, and Yong Yu. 2020. U-rank: Utility-oriented learning to rank with implicit feedback. InCIKM’20. 2373–2380
2020
- [7]
-
[8]
Jingsheng Gao, Linxu Li, Weiyuan Li, Yuzhuo Fu, and Bin Dai. 2025. Smartrag: Jointly learn rag-related tasks from the environment feedback.ICLR’25(2025)
2025
-
[9]
Xinyu Gao, Yun Xiong, Deze Wang, Zhenhan Guan, Zejian Shi, Haofen Wang, and Shanshan Li. 2024. Preference-Guided Refactored Tuning for Retrieval Augmented Code Generation. InASE’24. 65–77
2024
- [10]
-
[11]
Xuming Hu, Zhaochen Hong, Zhijiang Guo, Lijie Wen, and Philip Yu. 2023. Read it twice: Towards faithfully interpretable fact verification by revisiting evidence. InSIGIR’23. 2319–2323
2023
-
[12]
Gautier Izacard and Edouard Grave. 2020. Distilling knowledge from reader to retriever for question answering.ICLR’21(2020)
2020
-
[13]
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave
-
[14]
Atlas: few-shot learning with retrieval augmented language models.J. Mach. Learn. Res.24, 1, Article 251 (Jan. 2023), 43 pages
2023
-
[15]
Akriti Jain and Aparna Garimella. 2025. Modeling Contextual Passage Utility for Multihop Question Answering. InIJCNLP-AACL’25. 464–471
2025
-
[16]
Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park. 2024. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InNAACL’24. Association for Computational Linguistics, Mexico City, Mexico, 7036–7050
2024
-
[17]
Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. InEMNLP’23. 7969–7992
2023
-
[18]
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)
work page internal anchor Pith review arXiv 2025
-
[19]
Seikyung Jung, Jonathan L Herlocker, and Janet Webster. 2007. Click data as implicit relevance feedback in web search.Information processing & management 43, 3 (2007), 791–807
2007
-
[20]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP’20. 6769–6781
2020
-
[21]
Zixuan Ke, Weize Kong, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. 2024. Bridging the preference gap between retrievers and llms. In ACL’24. 10438–10451
2024
-
[22]
Diane Kelly and Nicholas J Belkin. 2004. Display time as implicit feedback: understanding task effects. InSIGIR’04. 377–384
2004
-
[23]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS’2033 (2020), 9459–9474
2020
- [24]
-
[25]
Hongyu Lu, Min Zhang, and Shaoping Ma. 2018. Between clicks and satisfaction: Study on multi-phase user preferences and satisfaction for online news reading. InSIGIR’18. 435–444
2018
-
[26]
Cheng Luo, Yiqun Liu, Tetsuya Sakai, Ke Zhou, Fan Zhang, Xue Li, and Shaoping Ma. 2017. Does document relevance affect the searcher’s perception of time?. In WSDM’17. 141–150
2017
-
[27]
Stephen E Robertson. 1977. The probability ranking principle in IR.Journal of documentation33, 4 (1977), 294–304
1977
-
[28]
Tefko Saracevic. 1975. Relevance: A review of and a framework for the thinking on the notion in information science.JASIST26, 6 (1975), 321–343
1975
-
[29]
Tefko Saracevic. 1996. Relevance reconsidered. InProceedings of the second conference on conceptions of library and information science (CoLIS 2). 201–218
1996
-
[30]
Tefko Saracevic, Paul Kantor, Alice Y Chamis, and Donna Trivison. 1988. A study of information seeking and retrieving. I. Background and methodology.JASIST 39, 3 (1988), 161–176
1988
-
[31]
2011.Reflections on the Problem of Relevance
Alfred Schutz and Lester Embree. 2011.Reflections on the Problem of Relevance. Springer
2011
-
[32]
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. InFindings of the EMNLP 2023. 9248– 9274
2023
-
[33]
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2024. REPLUG: Retrieval-Augmented Black-Box Language Models. InNAACL’24. 8371–8384
2024
-
[34]
Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. 2024. Dragin: Dynamic retrieval augmented generation based on the real-time information needs of large language models. InACL’24. 12991–13013
2024
-
[35]
Yue Wang, Dawei Yin, Luo Jie, Pengyuan Wang, Makoto Yamada, Yi Chang, and Qiaozhu Mei. 2016. Beyond ranking: Optimizing whole-page presentation. In WSDM2016. 103–112
2016
-
[36]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InEMNLP’18. 2369–2380
2018
-
[37]
Feng Yu, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2016. A dynamic recurrent model for next basket recommendation. InSIGIR’16. 729–732
2016
- [38]
-
[39]
Hengran Zhang, Keping Bi, Jiafeng Guo, and Xueqi Cheng. 2026. An Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy. In Findings of the ACL 2026
2026
- [40]
-
[41]
Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, and Xueqi Cheng. 2025. Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation. InSIGIR-AP’25. 22–30
2025
- [42]
-
[43]
Hengran Zhang, Minghao Tang, Keping Bi, Jiafeng Guo, Shihao Liu, Daiting Shi, Dawei Yin, and Xueqi Cheng. 2025. Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation. InEMNLP’25. 1683–1702
2025
-
[44]
Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2023. From relevance to utility: Evidence retrieval with feedback for fact verification. InFindings of the EMNLP 2023. 6373–6384
2023
-
[45]
Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024. Are Large Language Models Good at Utility Judgments?. In SIGIR’24. 1941–1951
2024
-
[46]
Xinping Zhao, Dongfang Li, Yan Zhong, Boren Hu, Yibin Chen, Baotian Hu, and Min Zhang. 2024. Seer: Self-aligned evidence extraction for retrieval-augmented generation.EMNLP’24(2024)
2024
- [47]
-
[48]
Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, and Yanyan Lan. 2012. More than relevance: high utility query recommendation by mining users’ search behaviors. InCIKM’12. 1814–1818
2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.