Recognition: unknown
Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models
Pith reviewed 2026-05-08 01:56 UTC · model grok-4.3
The pith
A lightweight router learns to pick query-specific attention heads in LLMs for improved document re-ranking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose RouteHead, a query-dependent head selection method for attention-based re-ranking with LLMs. We learn a lightweight router that maps each query to an optimal head set by representing heads with learnable embeddings and queries with embeddings from the hidden states of the frozen LLM. The router is trained on pseudo labels constructed via offline search together with a sparsity regularizer. Relevance scores are computed by aggregating attention signals only from the selected heads. Experiments on diverse benchmarks and multiple LLM backbones show that the proposed method consistently outperforms strong baselines.
What carries the argument
The lightweight router, which uses learnable head embeddings and query embeddings extracted from LLM hidden states to predict and select per-query optimal head sets from pseudo labels.
If this is right
- Re-ranking quality rises on diverse benchmarks because only non-redundant, query-relevant heads contribute to the final score.
- The same router architecture works across multiple different LLM backbones without retraining the underlying model.
- Sparsity regularization during training limits the number of heads used, reducing potential signal conflicts.
- Attention signals become more fine-grained and effective for zero-shot relevance estimation.
Where Pith is reading between the lines
- The routing mechanism could be adapted to other attention-heavy LLM tasks such as summarization or question answering where head utility also varies.
- If the router generalizes reliably, inference cost could drop by computing attention only for the selected heads rather than the full set.
- This approach resembles dynamic routing in mixture-of-experts models and might benefit from similar load-balancing techniques.
Load-bearing premise
The pseudo labels generated by offline search accurately identify the optimal head sets for each query and the router trained on those labels generalizes to new queries without inheriting search biases or overfitting.
What would settle it
Run the router on a held-out set of queries for which an exhaustive offline search can be performed to find the true best head combinations; if the router-selected heads produce lower re-ranking quality than those true optima or fail to beat fixed-head baselines, the central claim is falsified.
Figures
read the original abstract
Large Language Models (LLMs) have recently been explored as fine-grained zero-shot re-rankers by leveraging attention signals to estimate document relevance. However, existing methods either aggregate attention signals across all heads or rely on a statically selected subset identified by heuristic rules. This solution can be suboptimal because the informative heads can vary across queries or domains. Moreover, naively combining multiple heads can degrade performance due to redundancy or conflicting ranking signals. In this paper, we propose a query-dependent head selection method, RouteHead, for attention-based re-ranking with LLMs. Specifically, we learn a lightweight router that can map each query to an optimal head set, and relevance scores are computed by aggregating attention signals only from these heads. Since query-to-head optimal labels are unavailable, we first construct pseudo labels via an offline search. The router represents each head with a learnable embedding and represents each query using an embedding extracted from the hidden states of the frozen LLM. Then it is trained on the pseudo labels with a sparsity regularizer. Experiments on diverse benchmarks and multiple LLM backbones show that the proposed method consistently outperforms strong baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce RouteHead, a query-dependent head selection method for attention-based re-ranking using LLMs. It learns a lightweight router that maps each query to an optimal set of attention heads by using query embeddings from frozen LLM hidden states and learnable head embeddings. Pseudo labels for training are generated via an offline search, and the router is trained with a supervised loss plus sparsity regularizer to prevent using redundant heads. Experiments on diverse benchmarks with multiple LLM backbones show consistent outperformance compared to baselines that aggregate all heads or use static subsets.
Significance. If the central assumptions hold, this work could advance attention-based re-ranking by enabling dynamic, query-specific selection of informative heads, addressing the suboptimality of full aggregation or heuristic static selection. The lightweight nature of the router makes it practical for deployment. It highlights the potential of learning to route within LLM internals for IR tasks, and if the pseudo-label approach generalizes well, it could inspire similar techniques for other LLM components.
major comments (2)
- [§3.1] §3.1 (Pseudo-label construction): The offline search for generating pseudo labels lacks detail on the enumeration strategy, budget, heuristics, or any verification that selected head sets are optimal (e.g., no comparison to exhaustive search on small cases). This is load-bearing because the router is supervised directly on these labels; if the search misses superior combinations or introduces bias, reported gains may reflect label artifacts rather than learned query-dependent routing.
- [§4] §4 (Experiments): No ablation is reported comparing the trained router against an oracle head set (optimal heads found by search on held-out queries) or measuring router generalization error on unseen queries. Without this, it is impossible to confirm that outperformance stems from effective routing rather than the router inheriting search biases or overfitting to the pseudo-label distribution.
minor comments (2)
- [Abstract] Abstract: The claim of 'consistent outperformance' would be stronger with explicit mention of the specific benchmarks, LLM backbones, and baseline implementations used.
- Notation: The router architecture (query embedding extraction and head embedding interaction) would benefit from a figure or explicit pseudocode to clarify the forward pass and sparsity application.
Simulated Author's Rebuttal
We thank the referee for the insightful comments and the recommendation for major revision. We address the major comments point by point below, proposing revisions where appropriate to improve the clarity and rigor of the manuscript.
read point-by-point responses
-
Referee: [§3.1] §3.1 (Pseudo-label construction): The offline search for generating pseudo labels lacks detail on the enumeration strategy, budget, heuristics, or any verification that selected head sets are optimal (e.g., no comparison to exhaustive search on small cases). This is load-bearing because the router is supervised directly on these labels; if the search misses superior combinations or introduces bias, reported gains may reflect label artifacts rather than learned query-dependent routing.
Authors: We agree that more details are needed on the pseudo-label construction in §3.1. In the revised manuscript, we will provide a detailed description of the enumeration strategy, including the specific search algorithm, budget constraints, and heuristics employed. Additionally, we will include a verification experiment comparing our search results to exhaustive search on small-scale cases to demonstrate that the selected head sets are optimal or near-optimal. This will help confirm that the pseudo labels are reliable and not artifacts of the search process. revision: yes
-
Referee: [§4] §4 (Experiments): No ablation is reported comparing the trained router against an oracle head set (optimal heads found by search on held-out queries) or measuring router generalization error on unseen queries. Without this, it is impossible to confirm that outperformance stems from effective routing rather than the router inheriting search biases or overfitting to the pseudo-label distribution.
Authors: We acknowledge the importance of validating the router against an oracle and assessing generalization. We will add an ablation study reporting the router's prediction accuracy on a held-out validation set of queries to measure generalization error. However, performing the full offline search to obtain oracle head sets for all held-out test queries is computationally prohibitive given the scale of our experiments. We will explicitly discuss this limitation in the revised paper and argue that the consistent performance gains over static baselines indicate effective query-dependent routing rather than mere inheritance of search biases. revision: partial
- Obtaining oracle optimal head sets via search on the full held-out test queries, due to excessive computational requirements.
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper constructs pseudo labels for optimal head sets via an independent offline search process, then trains a router on query embeddings (from frozen LLM) and head embeddings using supervised loss plus sparsity regularizer. No equations or steps reduce the router's output or final performance claims to the inputs by construction. No self-citations, uniqueness theorems, or smuggled ansatzes are invoked as load-bearing. Experimental outperformance on benchmarks is presented as empirical validation rather than a mathematical identity. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- sparsity regularizer coefficient
axioms (1)
- domain assumption Attention heads produce varying and sometimes conflicting relevance signals across queries
invented entities (1)
-
Lightweight query-to-head router with learnable head embeddings
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A full-text learning to rank dataset for medical information retrieval. InAdvances in Information Retrieval: 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20–23, 2016. Proceedings 38. Springer, 716–722
2016
-
[2]
Shijie Chen, Bernal Jimenez Gutierrez, and Yu Su. 2025. Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers. InThe Thirteenth Inter- national Conference on Learning Representations
2025
-
[3]
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595(2023)
work page internal anchor Pith review arXiv 2023
-
[4]
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld
- [5]
- [6]
- [7]
-
[8]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997 2, 1 (2023)
work page internal anchor Pith review arXiv 2023
-
[9]
Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. 2017. DBpedia-entity v2: a test collection for entity search. InProceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 1265–1268
2017
-
[10]
Gautier Izacard and Edouard Grave. 2021. Distilling Knowledge from Reader to Retriever for Question Answering. InInternational Conference on Learning Representations. https://openreview.net/forum?id=NTEz-6wysdb
2021
-
[11]
Vitor Jeronymo, Mauricio Nascimento, Roberto Lotufo, and Rodrigo Nogueira
-
[12]
arXiv:2209.13738 [cs.CL] https://arxiv.org/abs/2209.13738
mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark. arXiv:2209.13738 [cs.CL] https://arxiv.org/abs/2209.13738
- [13]
-
[14]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Com...
-
[15]
Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval(Virtual Event, China)(SIGIR ’20). Association for Computing Machinery, New York, NY , USA, 39–48. doi:10.114...
-
[16]
Toutanova, Llion Jones, Ming-Wei Chang, An- drew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, An- drew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Research....
2019
-
[17]
Yibin Lei, Liang Ding, Yu Cao, Changtong Zan, Andrew Yates, and Dacheng Tao. 2023. Unsupervised Dense Retrieval with Relevance-Aware Contrastive Pre-Training. InFindings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 10932–1...
-
[18]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems33 (2020), 9459–9474
2020
-
[19]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al
-
[20]
Holistic evaluation of language models.arXiv preprint arXiv:2211.09110 (2022)
work page internal anchor Pith review arXiv 2022
- [21]
-
[22]
Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDer- mott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Chal- lenge: Financial Opinion Mining and Question Answering. InCompanion Pro- ceedings of the The Web Conference 2018(<conf-loc>, <city>Lyon</city>, <country>France</country>, </conf-loc>)(WWW ’18). International World W...
-
[23]
Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. 2024. Ranked List Truncation for Large Language Model-based Re-Ranking. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 141–151
2024
-
[24]
Chuan Meng, Negar Arabzadeh, Arian Askari, Mohammad Aliannejadi, and Maarten de Rijke. 2025. Query Performance Prediction Using Relevance Judg- ments Generated by Large Language Models.ACM Transactions on Information Systems43, 4 (2025), 1–35
2025
-
[25]
Chuan Meng, Litu Ou, Sean MacAvaney, and Jeff Dalton. 2026. Revisiting Text Ranking in Deep Research. InProceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval
2026
-
[26]
Chuan Meng, Francesco Tonolini, Fengran Mo, Nikolaos Aletras, Emine Yilmaz, and Gabriella Kazai. 2025. Bridging the Gap: From Ad-hoc to Proactive Search in Conversations. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 64–74
2025
-
[27]
Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one?Advances in neural information processing systems32 (2019)
2019
-
[28]
Fengran Mo, Yifan Gao, Chuan Meng, Xin Liu, Zhuofeng Wu, Kelong Mao, Zhengyang Wang, Pei Chen, Zheng Li, Xian Li, et al. 2025. Uniconv: Unifying retrieval and response generation for large language models in conversations. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6936–6949
2025
-
[29]
Fengran Mo, Kelong Mao, Ziliang Zhao, Hongjin Qian, Haonan Chen, Yiruo Cheng, Xiaoxi Li, Yutao Zhu, Zhicheng Dou, and Jian-Yun Nie. 2025. A survey of conversational search.ACM Transactions on Information Systems43, 6 (2025), 1–50
2025
-
[30]
Fengran Mo, Kelong Mao, Yutao Zhu, Yihong Wu, Kaiyu Huang, and Jian-Yun Nie. 2023. Convgqr: Generative query reformulation for conversational search. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4998–5012
2023
- [31]
-
[32]
Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang
-
[33]
Large Dual Encoders Are Generalizable Retrievers
Large Dual Encoders Are Generalizable Retrievers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Process- ing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 9844–9855. doi:10.18653/v1/2022.emnlp-main.669
- [34]
- [35]
-
[36]
Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, et al. 2024. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. InFindings of the Association for Computational Linguistics: NAACL 2024. 1504–1518. SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Y ...
2024
-
[37]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association...
-
[38]
Devendra Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen-tau Yih, Joelle Pineau, and Luke Zettlemoyer. 2022. Improving Passage Retrieval with Zero-Shot Question Generation. InProceedings of the 2022 Conference on Empir- ical Methods in Natural Language Processing. 3781–3797
2022
-
[39]
Hongjin SU, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han yu Wang, Liu Haisu, Quan Shi, Zachary S Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O Arik, Danqi Chen, and Tao Yu. 2025. BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval. InThe Thirteenth International Conference on Learning Representations....
2025
-
[40]
Zhan Su, Fengran Mo, Jinghan Zhang, Yuchen Hui, Jiaao Sun, and Jian yun Nie
-
[41]
arXiv:2511.17044 [cs.IR] https://arxiv.org/abs/2511.17044
Parametric Retrieval-Augmented Generation using Latent Routing of LoRA Adapters. arXiv:2511.17044 [cs.IR] https://arxiv.org/abs/2511.17044
-
[42]
Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Inves- tigating Large Language Models as Re-Ranking Agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for ...
-
[43]
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evalu- ation of Information Retrieval Models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=wCu6T5xFjeJ
2021
-
[44]
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal
-
[45]
FEVER: a large-scale dataset for fact extraction and VERification.arXiv preprint arXiv:1803.05355(2018)
work page internal anchor Pith review arXiv 2018
-
[46]
Yuxing Tian, Fengran Mo, Weixu Zhang, Yiyan Qi, and Jian-Yun Nie. 2026. ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting. In Findings of the Association for Computational Linguistics: EACL 2026, Vera Demberg, Kentaro Inui, and Lluís Marquez (Eds.). Association for Computational Linguistics, Rabat, Morocco, 1282–1295. doi:10.18653/v1...
- [47]
-
[48]
Elena V oita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5797–5808
2019
-
[49]
Ellen V oorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. TREC-COVID: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, V ol. 54. ACM New York, NY , USA, 1–12
2021
-
[50]
David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific Claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Lingui...
- [51]
-
[52]
Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2025. Re- trieval Head Mechanistically Explains Long-Context Factuality. InThe Thirteenth International Conference on Learning Representations
2025
-
[53]
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embeddings. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, New York, NY , USA, 641...
- [54]
- [55]
-
[56]
Jinghan Zhang, Fengran Mo, Tharindu Cyril Weerasooriya, Ruimin Dai, Xi- aoyan Han, Yanjie Fu, Dakuo Wang, and Kunpeng Liu. 2026. StaRPO: Stability- Augmented Reinforcement Policy Optimization.arXiv preprint arXiv:2604.08905 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[57]
Jinghan Zhang, Xiting Wang, Weijieying Ren, Lu Jiang, Dongjie Wang, and Kunpeng Liu. 2025. Ratt: A thought structure for coherent and correct llm reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, V ol. 39. 26733–26741
2025
-
[58]
Weixu Zhang, Fanghua Ye, Qiang Gao, Jian Li, Haolun Wu, Yuxing Tian, Sijing Duan, Nan Du, Xiaolong Li, and Xue Liu. 2026. Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding. arXiv:2604.22335 [cs.CL] https://arxiv.org/abs/2604.22335
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, and Xi Ye. 2025. Query- Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 23791–23805. doi:10.18653/v1/2025.emnlp-main.1214
-
[60]
Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, and Haolun Wu. 2026. Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personaliza- tion. arXiv:2604.22345 [cs.CL] https://arxiv.org/abs/2604.22345
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.