arxiv: 2604.08920 · v1 · submitted 2026-04-10 · 💻 cs.IR · cs.AI· cs.CL· cs.LG

Recognition: unknown

Beyond Relevance: Utility-Centric Retrieval in the LLM Era

Hengran Zhang , Minghao Tang , Keping Bi , Jiafeng Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:27 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.LG

keywords retrieval-augmented generationutility-centric retrievalLLMRAGrelevancegeneration qualityinformation retrievalagentic systems

0 comments

The pith

Retrieval for large language models must optimize for contribution to answer quality rather than topical relevance alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that in retrieval-augmented generation, documents serve as input to LLMs rather than direct reading material for users, so success requires judging how much the retrieved material improves the final generated output. This moves retrieval objectives from matching query topics toward supporting the LLM's information needs during answer production. A unified framework is offered to organize different forms of utility, including distinctions based on the specific model and the surrounding context, along with ties to agentic systems. Readers would care because conventional ranking metrics may steer systems toward results that do not actually help LLMs produce better answers. The tutorial draws on recent work to supply both conceptual structure and design advice for this new alignment.

Core claim

Information retrieval systems have traditionally optimized for topical relevance, yet in retrieval-augmented generation the retrieved documents act as evidence for LLMs that produce answers, so effectiveness must instead be evaluated by contribution to generation quality; this leads to evolving retrieval objectives from relevance-centric optimization toward LLM-centric utility, organized through a unified framework covering LLM-agnostic versus LLM-specific utility, context-independent versus context-dependent utility, and connections with LLM information needs and agentic RAG.

What carries the argument

Unified framework distinguishing LLM-agnostic versus LLM-specific utility and context-independent versus context-dependent utility in support of generation quality.

Load-bearing premise

Utility for LLMs can be defined and measured separately from relevance, and this distinction fundamentally alters the retrieval paradigm without requiring additional empirical validation of the framework.

What would settle it

A head-to-head experiment in which retrieval systems tuned for the proposed utility measures produce lower-quality LLM answers than systems tuned only for traditional relevance metrics on the same queries and tasks.

read the original abstract

Information retrieval systems have traditionally optimized for topical relevance-the degree to which retrieved documents match a query. However, relevance only approximates a deeper goal: utility, namely, whether retrieved information helps accomplish a user's underlying task. The emergence of retrieval-augmented generation (RAG) fundamentally changes this paradigm. Retrieved documents are no longer consumed directly by users but instead serve as evidence for large language models (LLMs) that produce answers. As a result, retrieval effectiveness must be evaluated by its contribution to generation quality rather than by relevance-based ranking metrics alone. This tutorial argues that retrieval objectives are evolving from relevance-centric optimization toward LLM-centric utility. We present a unified framework covering LLM-agnostic versus LLM-specific utility, context-independent versus context-dependent utility, and the connection with LLM information needs and agentic RAG. By synthesizing recent advances, the tutorial provides conceptual foundations and practical guidance for designing retrieval systems aligned with the requirements of LLM-based information access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tutorial synthesizing the shift from relevance to utility in LLM retrieval, clear on concepts but no new evidence or tests.

read the letter

This paper is a tutorial that pushes the idea that retrieval in the LLM era should prioritize utility—how much the retrieved information improves the quality of generated answers—over traditional relevance metrics. The authors argue that because documents now feed into LLMs for RAG rather than being read directly, evaluation and optimization need to change accordingly. It does a good job organizing recent ideas into one framework. It covers distinctions like LLM-agnostic utility versus LLM-specific, context-independent versus context-dependent, and links these to LLM information needs and agentic RAG. The synthesis brings together advances in the area and gives some practical pointers for building retrieval systems that fit LLM-based access. This kind of overview can help clarify thinking for people working in the space. The main limitation is that it stays at the conceptual level with no new experiments, data, or derivations. The claim that we must evaluate retrieval by its effect on generation quality follows from the RAG setup, but the paper relies on cited work rather than showing direct comparisons or new measurements here. Defining and measuring utility separately from relevance is presented as key, yet without detailed methods or results in this manuscript, it leaves open how practitioners would implement this shift reliably. Readers who want a structured summary of where retrieval objectives are heading in RAG will find it helpful. It is less for those seeking new techniques or benchmarks. The work shows clear engagement with the literature and organizes it thoughtfully, so it deserves a serious referee to get feedback on the framework and suggestions for strengthening the practical side. I recommend sending this to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that traditional IR systems optimize for topical relevance, but the rise of RAG with LLMs requires a shift to utility-centric retrieval, where effectiveness is judged by contribution to downstream generation quality rather than relevance metrics. It presents a unified framework distinguishing LLM-agnostic vs. LLM-specific utility and context-independent vs. context-dependent utility, while connecting these to LLM information needs and agentic RAG, and synthesizes recent advances to offer conceptual foundations and practical guidance for LLM-aligned retrieval design.

Significance. If the conceptual reorientation holds, the tutorial provides a useful organizing lens for the field by synthesizing advances in utility definitions and RAG-specific retrieval, offering practical guidance that could influence system design as generative AI becomes dominant. The absence of new derivations, proofs, or empirical results means its value lies in synthesis rather than novel testable claims.

major comments (2)

[§3] §3 (Framework): The central distinction between LLM-agnostic and LLM-specific utility is presented conceptually but lacks operational definitions or examples of how these would alter retrieval objectives or ranking functions compared to standard relevance; this weakens the claim that the paradigm fundamentally changes without additional validation mechanisms.
[§4] §4 (Connection to agentic RAG): The argument that retrieval must support agentic workflows relies on cited prior work but does not demonstrate through any concrete case how utility metrics would be computed or optimized differently from relevance in multi-step agent scenarios, leaving the practical guidance underspecified.

minor comments (2)

[Abstract] The abstract and introduction repeat the core claim about relevance approximating utility without citing a specific prior result or example that illustrates the approximation gap.
[§3] Notation for utility types (e.g., U_LLM-agnostic) is introduced but not consistently used in later sections when discussing measurement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We appreciate the recognition of the tutorial's value as a synthesis and will use the feedback to strengthen the practical aspects of the framework and guidance.

read point-by-point responses

Referee: [§3] §3 (Framework): The central distinction between LLM-agnostic and LLM-specific utility is presented conceptually but lacks operational definitions or examples of how these would alter retrieval objectives or ranking functions compared to standard relevance; this weakens the claim that the paradigm fundamentally changes without additional validation mechanisms.

Authors: We agree that additional operational detail would improve clarity. In the revised manuscript, we will expand §3 to include explicit operational definitions and illustrative examples. For LLM-agnostic utility, we will describe metrics such as document coverage or information gain that apply across models; for LLM-specific utility, we will show incorporation of model-dependent signals like predicted generation quality or token-level utility. We will also provide concrete examples of how these alter ranking objectives, such as replacing or augmenting relevance scores with utility predictors in a re-ranking step, drawing directly from synthesized prior work. This will demonstrate the shift without introducing new empirical claims. revision: yes
Referee: [§4] §4 (Connection to agentic RAG): The argument that retrieval must support agentic workflows relies on cited prior work but does not demonstrate through any concrete case how utility metrics would be computed or optimized differently from relevance in multi-step agent scenarios, leaving the practical guidance underspecified.

Authors: We acknowledge that the guidance in §4 would be strengthened by greater specificity. We will revise the section to include a detailed concrete case study of a multi-step agentic scenario (e.g., iterative multi-hop question answering). The example will illustrate computation of context-dependent utility at each step—based on the agent's intermediate state and projected impact on final output quality—versus static relevance, along with optimization approaches such as utility-guided retrieval planning. This will be supported by references to existing methods while remaining within the tutorial's synthetic scope. revision: yes

Circularity Check

0 steps flagged

No significant circularity; conceptual tutorial without derivation chain

full rationale

The paper is a tutorial synthesizing prior literature on retrieval for RAG, presenting a conceptual reorientation toward utility-centric evaluation. No equations, formal proofs, fitted parameters, or new empirical results are introduced that could reduce to self-definitions or self-citations. The central claim is supported by external citations rather than internal reductions, and the framework functions as an organizing lens with independent content from prior advances. This is self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central argument rests on domain assumptions about how LLMs consume retrieved context and what constitutes utility, without new free parameters, invented entities, or formal derivations in the abstract.

axioms (1)

domain assumption Utility is best measured by downstream contribution to LLM generation quality rather than topical match
This premise drives the claim that retrieval objectives must evolve and is invoked throughout the abstract as the basis for the new paradigm.

pith-pipeline@v0.9.0 · 5469 in / 1041 out tokens · 42658 ms · 2026-05-10T17:27:44.175442+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Harry W Bruce. 1994. A cognitive view of the situational dynamism of user- centered relevance estimation.JASIST45, 3 (1994), 142–148

1994
[2]

Georg Buscher, Ludger Van Elst, and Andreas Dengel. 2009. Segment-level display time as implicit feedback: a comparison to eye tracking. InSIGIR’09. 67–74

2009
[3]

Lorenzo Canale, Stefano Scotta, Alberto Messina, and Laura Farinetti. 2025. BES4RAG: A Framework for Embedding Model Selection in Retrieval-Augmented Generation. InCLiC-it 2025. CEUR Workshop, Cagliari, Italy, 134–142

2025
[4]

William S Cooper. 1971. A definition of relevance for information retrieval. Information storage and retrieval7, 1 (1971), 19–37

1971
[5]

Lu Dai, Yijie Xu, Jinhui Ye, Hao Liu, and Hui Xiong. 2025. Seper: Measure retrieval utility through the lens of semantic perplexity reduction.ICLR’26(2025)

2025
[6]

Xinyi Dai, Jiawei Hou, Qing Liu, Yunjia Xi, Ruiming Tang, Weinan Zhang, Xi- uqiang He, Jun Wang, and Yong Yu. 2020. U-rank: Utility-oriented learning to rank with implicit feedback. InCIKM’20. 2373–2380

2020
[7]

Chunjing Gan, Dan Yang, Binbin Hu, Hanxiao Zhang, Siyuan Li, Ziqi Liu, Yue Shen, Lin Ju, Zhiqiang Zhang, Jinjie Gu, et al. 2024. Similarity is not all you need: Endowing retrieval augmented generation with multi layered thoughts.arXiv preprint arXiv:2405.19893(2024)

work page arXiv 2024
[8]

Jingsheng Gao, Linxu Li, Weiyuan Li, Yuzhuo Fu, and Bin Dai. 2025. Smartrag: Jointly learn rag-related tasks from the environment feedback.ICLR’25(2025)

2025
[9]

Xinyu Gao, Yun Xiong, Deze Wang, Zhenhan Guan, Zejian Shi, Haofen Wang, and Shanshan Li. 2024. Preference-Guided Refactored Tuning for Retrieval Augmented Code Generation. InASE’24. 65–77

2024
[10]

Sebastian Hofstätter, Jiecao Chen, Karthik Raman, and Hamed Zamani. 2022. Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling. arXiv preprint arXiv:2207.03030(2022)

work page arXiv 2022
[11]

Xuming Hu, Zhaochen Hong, Zhijiang Guo, Lijie Wen, and Philip Yu. 2023. Read it twice: Towards faithfully interpretable fact verification by revisiting evidence. InSIGIR’23. 2319–2323

2023
[12]

Gautier Izacard and Edouard Grave. 2020. Distilling knowledge from reader to retriever for question answering.ICLR’21(2020)

2020
[13]

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave
[14]

Atlas: few-shot learning with retrieval augmented language models.J. Mach. Learn. Res.24, 1, Article 251 (Jan. 2023), 43 pages

2023
[15]

Akriti Jain and Aparna Garimella. 2025. Modeling Contextual Passage Utility for Multihop Question Answering. InIJCNLP-AACL’25. 464–471

2025
[16]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park. 2024. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InNAACL’24. Association for Computational Linguistics, Mexico City, Mexico, 7036–7050

2024
[17]

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. InEMNLP’23. 7969–7992

2023
[18]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)

work page internal anchor Pith review arXiv 2025
[19]

Seikyung Jung, Jonathan L Herlocker, and Janet Webster. 2007. Click data as implicit relevance feedback in web search.Information processing & management 43, 3 (2007), 791–807

2007
[20]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP’20. 6769–6781

2020
[21]

Zixuan Ke, Weize Kong, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. 2024. Bridging the preference gap between retrievers and llms. In ACL’24. 10438–10451

2024
[22]

Diane Kelly and Nicholas J Belkin. 2004. Display time as implicit feedback: understanding task effects. InSIGIR’04. 377–384

2004
[23]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS’2033 (2020), 9459–9474

2020
[24]

Xiaonan Li, Changtai Zhu, Linyang Li, Zhangyue Yin, Tianxiang Sun, and Xipeng Qiu. 2023. Llatrieval: Llm-verified retrieval for verifiable generation.arXiv preprint arXiv:2311.07838(2023)

work page arXiv 2023
[25]

Hongyu Lu, Min Zhang, and Shaoping Ma. 2018. Between clicks and satisfaction: Study on multi-phase user preferences and satisfaction for online news reading. InSIGIR’18. 435–444

2018
[26]

Cheng Luo, Yiqun Liu, Tetsuya Sakai, Ke Zhou, Fan Zhang, Xue Li, and Shaoping Ma. 2017. Does document relevance affect the searcher’s perception of time?. In WSDM’17. 141–150

2017
[27]

Stephen E Robertson. 1977. The probability ranking principle in IR.Journal of documentation33, 4 (1977), 294–304

1977
[28]

Tefko Saracevic. 1975. Relevance: A review of and a framework for the thinking on the notion in information science.JASIST26, 6 (1975), 321–343

1975
[29]

Tefko Saracevic. 1996. Relevance reconsidered. InProceedings of the second conference on conceptions of library and information science (CoLIS 2). 201–218

1996
[30]

Tefko Saracevic, Paul Kantor, Alice Y Chamis, and Donna Trivison. 1988. A study of information seeking and retrieving. I. Background and methodology.JASIST 39, 3 (1988), 161–176

1988
[31]

2011.Reflections on the Problem of Relevance

Alfred Schutz and Lester Embree. 2011.Reflections on the Problem of Relevance. Springer

2011
[32]

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. InFindings of the EMNLP 2023. 9248– 9274

2023
[33]

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2024. REPLUG: Retrieval-Augmented Black-Box Language Models. InNAACL’24. 8371–8384

2024
[34]

Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. 2024. Dragin: Dynamic retrieval augmented generation based on the real-time information needs of large language models. InACL’24. 12991–13013

2024
[35]

Yue Wang, Dawei Yin, Luo Jie, Pengyuan Wang, Makoto Yamada, Yi Chang, and Qiaozhu Mei. 2016. Beyond ranking: Optimizing whole-page presentation. In WSDM2016. 103–112

2016
[36]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InEMNLP’18. 2369–2380

2018
[37]

Feng Yu, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2016. A dynamic recurrent model for next basket recommendation. InSIGIR’16. 729–732

2016
[38]

Dingchu Zhang, Yida Zhao, Jialong Wu, Baixuan Li, Wenbiao Yin, Liwen Zhang, Yong Jiang, Yufeng Li, Kewei Tu, Pengjun Xie, et al . 2025. EvolveSearch: An Iterative Self-Evolving Search Agent.arXiv preprint arXiv:2505.22501(2025)

work page arXiv 2025
[39]

Hengran Zhang, Keping Bi, Jiafeng Guo, and Xueqi Cheng. 2026. An Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy. In Findings of the ACL 2026

2026
[40]

Hengran Zhang, Keping Bi, Jiafeng Guo, Xiaojie Sun, Shihao Liu, Daiting Shi, Dawei Yin, and Xueqi Cheng. 2025. Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling.arXiv preprint arXiv:2504.05216 (2025)

work page arXiv 2025
[41]

Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, and Xueqi Cheng. 2025. Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation. InSIGIR-AP’25. 22–30

2025
[42]

Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, and Xueqi Cheng. 2025. LLM-Specific Utility: A New Perspective for Retrieval- Augmented Generation.arXiv preprint arXiv:2510.11358(2025)

work page arXiv 2025
[43]

Hengran Zhang, Minghao Tang, Keping Bi, Jiafeng Guo, Shihao Liu, Daiting Shi, Dawei Yin, and Xueqi Cheng. 2025. Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation. InEMNLP’25. 1683–1702

2025
[44]

Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2023. From relevance to utility: Evidence retrieval with feedback for fact verification. InFindings of the EMNLP 2023. 6373–6384

2023
[45]

Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024. Are Large Language Models Good at Utility Judgments?. In SIGIR’24. 1941–1951

2024
[46]

Xinping Zhao, Dongfang Li, Yan Zhong, Boren Hu, Yibin Chen, Baotian Hu, and Min Zhang. 2024. Seer: Self-aligned evidence extraction for retrieval-augmented generation.EMNLP’24(2024)

2024
[47]

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. 2025. Deepresearcher: Scaling deep research via reinforce- ment learning in real-world environments.arXiv preprint arXiv:2504.03160(2025)

work page arXiv 2025
[48]

Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, and Yanyan Lan. 2012. More than relevance: high utility query recommendation by mining users’ search behaviors. InCIKM’12. 1814–1818

2012