arxiv: 2604.17237 · v1 · submitted 2026-04-19 · 💻 cs.IR · cs.AI

Recognition: unknown

HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads

Juyuan Wang , Chenxing Wang , Yuchen Fang , Huiyun Hu , Junwu Du , Aolin Li , Haijun Wu , Jin Xu , Ligang Liu , Dongliang Liao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:31 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords decoding-free rerankingattention headspreference optimizationLLMpassage rerankingmiddle contextlistwise rankinginformation retrieval

0 comments

The pith

HeadRank reranks passages by aligning LLM attention heads to preferences in continuous space without decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that preference optimization can be lifted from token generation into LLM attention weights to enable fast, decoding-free reranking of passages. It targets the homogenization problem where middle-context documents receive nearly identical attention scores, erasing the distinctions needed for accurate ranking. By selecting heads through entropy regularization, training on hard adjacent-level preference pairs, and applying a distribution regularizer, the approach sharpens discriminability specifically in the middle zone. This matters for latency-sensitive retrieval because it delivers listwise reranking with only a single forward pass after depth truncation, using just 211 training queries across multiple model scales.

Core claim

HeadRank lifts preference optimization from discrete token space into the continuous attention domain through entropy-regularized head selection, hard adjacent-level preference pairs, and a distribution regularizer that jointly sharpen discriminability in the homogenized middle zone. Depth truncation at the deepest selected layer further reduces inference to O(1) forward passes. Across 14 benchmarks on three Qwen3 scales using only 211 training queries, it consistently outperforms generative and decoding-free baselines with 100% formatting success.

What carries the argument

Entropy-regularized head selection combined with hard adjacent-level preference pairs and a distribution regularizer that aligns attention heads to preferences for listwise reranking in the attention domain.

If this is right

HeadRank outperforms generative and decoding-free baselines across 14 benchmarks on three model scales from 0.6B to 4B.
It achieves 100 percent formatting success on reranking outputs.
At 4B scale, 57.4 percent of relevant middle-zone documents reach the top quartile compared with 14.2 percent for irrelevant ones.
Performance holds with only 211 training queries.
Depth truncation reduces inference cost to a constant number of forward passes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same head-selection and regularizer pattern could be applied to other long-context tasks that need fine-grained relevance signals, such as multi-document summarization.
Training with hard adjacent pairs may allow effective alignment from far smaller preference datasets than token-level methods require.
The O(1) forward-pass property opens the door to real-time reranking pipelines that combine HeadRank with existing retrieval indexes.

Load-bearing premise

That entropy-regularized head selection combined with hard adjacent-level preference pairs and a distribution regularizer can reliably overcome attention homogenization in middle context using only 211 training queries without introducing new biases or overfitting.

What would settle it

Measuring the top-quartile placement rate for relevant versus irrelevant middle-zone documents on a held-out benchmark with longer contexts or unseen model scales; if the gap shrinks to near zero, the claim of sharpened discriminability fails.

Figures

Figures reproduced from arXiv: 2604.17237 by Aolin Li, Chenxing Wang, Dongliang Liao, Haijun Wu, Huiyun Hu, Jin Xu, Junwu Du, Juyuan Wang, Ligang Liu, Yuchen Fang.

**Figure 1.** Figure 1: Comparison of reranking paradigms. model can produce malformed outputs that silently corrupt the ranking. In latency-sensitive retrieval pipelines, these drawbacks impose practical barriers to deployment. A recent line of work sidesteps generation entirely (Chen et al., 2025; Tran et al., 2025; Na et al., 2025). Rather than decoding tokens, these methods read off relevance signals from the attention wei… view at source ↗

**Figure 2.** Figure 2: Middle-zone normalized attention-score standard deviation (↑ better) across five methods, eight datasets, and three model scales. Lighter cells indicate more severe attention homogenization. NIAH QR-R CoRe Ours 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Layer Position L22 +6L L18 +2L L17 +1L L16 Qwen3-0.6B (28L) NIAH QR-R CoRe Ours L22 +6L L17 +1L L19 +3L L16 Qwen3-1.7B (28L) NIAH QR-R CoRe Ours L29 +5L L24 0L L23… view at source ↗

**Figure 3.** Figure 3: Depth distribution of selected core heads (Qwen3-0.6B). Heads above the dashed line at lmax are pruned for early-exit inference. mal anchor preserves linguistic priors while steering toward relevance preferences. 3.5 Deep Analysis Attention Homogenization Across Methods and Scales. How pervasive is score flatlining in the middle zone, and does any method escape it? Figure 2 diagnoses this across five met… view at source ↗

**Figure 4.** Figure 4: Middle-to-front promotion rates averaged across eight datasets. Documents in the middle zone (25th– 75th percentile of BM25 ranks) are checked for promotion to the top quartile after reranking. Left bars: relevant documents promoted (↑ better); right bars: irrelevant documents promoted (↓ better) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: plots inference latency against NDCG@10 across all methods and model scales. HeadRank occupies the Pareto frontier at every latency tier: it delivers the highest NDCG@10 among all compared methods while avoiding the autoregressive decoding overhead of RankGPT. Depth truncation at layer lmax ( [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Per-dataset radar profiles of normalised middle-zone standard deviation. HeadRank consistently occupies [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Scaling behavior of middle-zone normalized std (↑ better) as model capacity increases from 0.6B → 1.7B → 4B, reported per dataset (2×4 grid). HeadRank exhibits monotonic improvement on six of eight datasets and dominates all baselines at every scale. summation. Gradient norms are clipped at 5.0, and all models are trained for a single epoch. Convergence is reached at approximately 1,800 steps for the 0.6… view at source ↗

read the original abstract

Decoding-free reranking methods that read relevance signals directly from LLM attention weights offer significant latency advantages over autoregressive approaches, yet suffer from attention score homogenization: middle-context documents receive near-identical scores, destroying the fine-grained distinctions required for ranking. We propose HeadRank, a framework that lifts preference optimization from discrete token space into the continuous attention domain through entropy-regularized head selection, hard adjacent-level preference pairs, and a distribution regularizer that jointly sharpen discriminability in the homogenized middle zone. Depth truncation at the deepest selected layer further reduces inference to $\mathcal{O}(1)$ forward passes. Across 14 benchmarks on three Qwen3 scales (0.6B--4B) using only 211 training queries, HeadRank consistently outperforms generative and decoding-free baselines with 100\% formatting success. At 4B, 57.4\% of relevant middle-zone documents reach the top quartile versus 14.2\% for irrelevant ones -- a 43-percentage-point selectivity gap that demonstrates the effectiveness of attention-space preference alignment for listwise reranking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HeadRank applies preference optimization to attention heads to sharpen middle-context reranking without decoding, but the 211-query training set makes the reported gains hard to trust.

read the letter

HeadRank takes the idea of preference optimization and moves it into the continuous space of LLM attention heads. The goal is to break up the homogenization where middle-context passages all get nearly the same score, while keeping everything decoding-free and cheap at inference time through depth truncation and head selection. They add entropy regularization on the heads, hard adjacent-level preference pairs, and a joint distribution regularizer to push discriminability where it is needed most. Training happens on 211 queries across three Qwen3 scales, with tests on 14 benchmarks. The headline number is the 43-point selectivity gap at 4B between relevant and irrelevant middle-zone documents reaching the top quartile, plus 100% formatting success and consistent wins over both generative and prior decoding-free baselines. That combination of techniques in attention space is the concrete new piece; it is not just another prompt or fine-tune on top of existing rerankers. The latency and scaling claims line up with what practitioners need for real-time RAG. The main weakness is the training regime. Two hundred eleven queries is a small set for optimizing head selection and regularization parameters, and nothing in the abstract shows ablations, cross-validation, error bars, or checks against query-specific artifacts. The risk that the method fits spurious patterns rather than general relevance signals is real and directly affects whether the selectivity gap will hold up. No details on how the regularizer strength was chosen or how head selection was validated make it difficult to judge robustness. This work is for people building low-latency retrieval stacks who already work with attention-based signals. A reader who needs concrete numbers on decoding-free reranking and is willing to verify the experiments themselves could extract useful ideas. It has a focused method and testable claims, so it deserves a serious referee to check the full protocol, ablations, and generalization evidence. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces HeadRank, a decoding-free reranking framework that transfers preference optimization into the continuous attention domain of LLMs. It employs entropy-regularized head selection, hard adjacent-level preference pairs, and a distribution regularizer to mitigate attention homogenization in middle-context passages, combined with depth truncation for O(1) forward passes. Evaluated on 14 benchmarks across Qwen3 models (0.6B–4B) trained on only 211 queries, it reports consistent gains over generative and decoding-free baselines, 100% formatting success, and a 43-percentage-point selectivity gap (57.4% vs. 14.2%) for relevant vs. irrelevant middle-zone documents at the 4B scale.

Significance. If the results are robust, the work offers a latency-efficient alternative to autoregressive rerankers by directly exploiting attention signals, which could impact large-scale IR systems. The small training budget and perfect formatting success are practical strengths. The reported selectivity gap provides concrete evidence that attention-space alignment can restore discriminability where standard attention fails. These elements, if reproducible, position the method as a useful contribution to decoding-free reranking.

major comments (2)

[§4.2] §4.2 (Training protocol): The head selection is optimized on only 211 queries using entropy regularization and hard adjacent preferences. No cross-validation, training-size ablation, or analysis of query diversity is described, which is load-bearing for the claim that the 43pp selectivity gap generalizes across 14 benchmarks and three model scales. Without such checks, the optimization may overfit to artifacts in the preference data rather than learning robust relevance signals.
[Table 1 and §5.1] Table 1 and §5.1 (Results): The 57.4% vs. 14.2% middle-zone selectivity figures and all benchmark comparisons are reported without error bars, standard deviations, or statistical significance tests. This undermines confidence in the central claim that HeadRank “consistently outperforms” baselines, as variance across runs or seeds cannot be assessed.

minor comments (2)

[§3.1] The abstract and §3.1 refer to “listwise reranking” yet the preference pairs are adjacent-level (pairwise). Clarify the distinction and whether the method is strictly pairwise or approximates listwise ranking.
[Figure 2] Figure 2 (attention heatmaps) would benefit from explicit scale bars and a side-by-side comparison with the baseline attention distribution to visually substantiate the homogenization claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing our strongest honest defense while noting where revisions are warranted to improve clarity and robustness.

read point-by-point responses

Referee: [§4.2] §4.2 (Training protocol): The head selection is optimized on only 211 queries using entropy regularization and hard adjacent preferences. No cross-validation, training-size ablation, or analysis of query diversity is described, which is load-bearing for the claim that the 43pp selectivity gap generalizes across 14 benchmarks and three model scales. Without such checks, the optimization may overfit to artifacts in the preference data rather than learning robust relevance signals.

Authors: We appreciate the referee's concern about potential overfitting given the modest training set of 211 queries. The small data regime is intentional and presented as a strength, enabling practical deployment with minimal supervision. Generalization is supported by consistent outperformance across 14 benchmarks that span diverse domains and query types, as well as across three model scales (0.6B–4B). The entropy regularization, hard adjacent-level pairs, and distribution regularizer are explicitly designed to discourage homogenization and promote head selection that captures broad relevance patterns rather than data-specific artifacts. Nevertheless, we agree that explicit checks would strengthen the claims. In the revised manuscript we will add a description of how the 211 queries were selected for diversity, a short discussion of query characteristics, and an explicit statement of the limitation regarding the absence of cross-validation or training-size ablations. revision: partial
Referee: [Table 1 and §5.1] Table 1 and §5.1 (Results): The 57.4% vs. 14.2% middle-zone selectivity figures and all benchmark comparisons are reported without error bars, standard deviations, or statistical significance tests. This undermines confidence in the central claim that HeadRank “consistently outperforms” baselines, as variance across runs or seeds cannot be assessed.

Authors: We agree that the absence of error bars and statistical tests reduces the ability to quantify result stability. Because HeadRank is decoding-free, the inference path is deterministic once the selected heads and truncation depth are fixed; however, the preference optimization stage does contain stochastic elements. To address the referee's point directly, we will rerun the training with multiple random seeds, report standard deviations for the key metrics (including the middle-zone selectivity gap) in the revised Table 1 and §5.1, and add pairwise statistical significance tests (e.g., Wilcoxon signed-rank) against the strongest baselines. These additions will be included in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes training an entropy-regularized head selection process with hard adjacent-level preference pairs and a distribution regularizer on 211 queries, then evaluates the resulting attention-based reranking on 14 held-out benchmarks across three model scales. Reported metrics such as the 43-percentage-point selectivity gap are measured on separate test data rather than being fitted or renamed versions of the training objective. No equations reduce the central claims to self-definitional inputs, no uniqueness theorems are imported via self-citation, and no ansatzes are smuggled through prior work. The derivation remains self-contained against external benchmarks and externally defined preference pairs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The described components (entropy regularization, preference pairs, distribution regularizer) imply the existence of tunable hyperparameters whose values and fitting procedure are not stated.

pith-pipeline@v0.9.0 · 5514 in / 1306 out tokens · 73907 ms · 2026-05-10T06:31:49.425911+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 21 canonical work pages · 6 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Is C hat GPT good at search? investigating large language models as re-ranking agents

Sun, Weiwei and Yan, Lingyong and Ma, Xinyu and Wang, Shuaiqiang and Ren, Pengjie and Chen, Zhumin and Yin, Dawei and Ren, Zhaochun. Is C hat GPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.923

work page doi:10.18653/v1/2023.emnlp-main.923 2023
[9]

The Thirteenth International Conference on Learning Representations , year=

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , author=. The Thirteenth International Conference on Learning Representations , year=
[10]

Advances in Neural Information Processing Systems , volume=

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , author=. Advances in Neural Information Processing Systems , volume=
[11]

arXiv preprint arXiv:2510.02219 , year=

Contrastive Retrieval Heads Improve Attention-Based Re-Ranking , author=. arXiv preprint arXiv:2510.02219 , year=

work page arXiv
[12]

Transactions of the Association for Computational Linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

2024
[13]

Advances in Neural Information Processing Systems , volume=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in Neural Information Processing Systems , volume=
[14]

Foundations and Trends

The probabilistic relevance framework: BM25 and beyond , author=. Foundations and Trends. 2009 , publisher=

2009
[15]

doi: 10.18653/v1/2022.naacl-main.272

Santhanam, Keshav and Khattab, Omar and Saad-Falcon, Jon and Potts, Christopher and Zaharia, Matei. C ol BERT v2: Effective and Efficient Retrieval via Lightweight Late Interaction. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naac...

work page doi:10.18653/v1/2022.naacl-main.272 2022
[16]

The Thirteenth International Conference on Learning Representations , year=

Retrieval Head Mechanistically Explains Long-Context Factuality , author=. The Thirteenth International Conference on Learning Representations , year=
[17]

Jinghao Zhang, Yuting Liu, Wenjie Wang, Qiang Liu, Shu Wu, Liang Wang, and Tat-Seng Chua

Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking , author=. arXiv preprint arXiv:2506.09944 , year=

work page arXiv
[18]

Advances in neural information processing systems , volume=

Are sixteen heads really better than one? , author=. Advances in neural information processing systems , volume=
[19]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
[20]

In-context Learning and Induction Heads

In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

work page internal anchor Pith review arXiv
[21]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[22]

Representation Learning with Contrastive Predictive Coding

Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

What does bert look at? an analysis of bert’s attention

What does bert look at? an analysis of bert's attention , author=. arXiv preprint arXiv:1906.04341 , year=

work page arXiv 1906
[24]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

2024
[25]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

Identifying Semantic Induction Heads to Understand In-Context Learning , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

2024
[26]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

2024
[27]

International conference on machine learning , pages=

Calibrate before use: Improving few-shot performance of language models , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[28]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

HotpotQA: A dataset for diverse, explainable multi-hop question answering , author=. arXiv preprint arXiv:1809.09600 , year=

work page internal anchor Pith review arXiv
[29]

Transactions of the Association for Computational Linguistics , volume=

MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=
[30]

FEVER: a large-scale dataset for Fact Extraction and VERification

FEVER: a large-scale dataset for fact extraction and VERification , author=. arXiv preprint arXiv:1803.05355 , year=

work page internal anchor Pith review arXiv
[31]

Transactions of the Association for Computational Linguistics , volume=

Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

2019
[32]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps,

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps , author=. arXiv preprint arXiv:2011.01060 , year=

work page arXiv 2011
[33]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[34]

CoRRabs/2003.07820(2020), https://arxiv.org/ abs/2003.07820

Overview of the TREC 2019 deep learning track , author=. arXiv preprint arXiv:2003.07820 , year=

work page arXiv 2019
[35]

Zhang, Le and Wang, Bo and Qiu, Xipeng and Reddy, Siva and Agrawal, Aishwarya , journal=
[36]

Zhuang, Honglei and others , journal=. Rank-
[37]

arXiv preprint arXiv:2310.07712 , year=

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models , author=. arXiv preprint arXiv:2310.07712 , year=

work page arXiv
[38]

Forty-Second International Conference on Machine Learning , year=

Which Attention Heads Matter for In-Context Learning? , author=. Forty-Second International Conference on Machine Learning , year=
[39]

Pradeep, Ronak and Sharifymoghaddam, Sahel and Lin, Jimmy , journal=. Rank
[40]

Reddy, Revanth Gangi and Doo, JaeYoung and Xu, Yifei and Sultan, Md Arafat and Bui, Devansh and Ji, Heng , journal=
[41]

Proceedings of the 47th International ACM SIGIR Conference , year=

A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models , author=. Proceedings of the 47th International ACM SIGIR Conference , year=
[42]

Document ranking with a pretrained sequence-to-sequence model,

Document Ranking with a Pretrained Sequence-to-Sequence Model , author=. arXiv preprint arXiv:2003.06713 , year=

work page arXiv 2003
[43]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

Is Attention Interpretable? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
[44]

Proceedings of the 2019 Conference on EMNLP-IJCNLP , pages=

Attention is not not explanation , author=. Proceedings of the 2019 Conference on EMNLP-IJCNLP , pages=

2019
[45]

Beyond Yes and No: Improving Zero-Shot

Zhuang, Honglei and Qin, Zhen and Hui, Kai and Wu, Junru and Yan, Le and Wang, Xuanhui and Bendersky, Michael , journal=. Beyond Yes and No: Improving Zero-Shot
[46]

Efficient Streaming Language Models with Attention Sinks

Efficient Streaming Language Models with Attention Sinks , author=. arXiv preprint arXiv:2309.17453 , year=

work page internal anchor Pith review arXiv
[47]

arXiv preprint arXiv:2310.01427 , year=

Attention Sorting Combats Recency Bias in Long Context Language Models , author=. arXiv preprint arXiv:2310.01427 , year=

work page arXiv
[48]

Preprint, arXiv:2308.07107

Large Language Models for Information Retrieval: A Survey , author=. arXiv preprint arXiv:2308.07107 , year=

work page arXiv
[49]

Fine-tuning

Ma, Xueguang and Liang, Liang and Wang, Nan and Yang, Hao and Lin, Jimmy , booktitle=. Fine-tuning
[50]

Voorhees, Ellen and Alam, Tasmeer and Bedrick, Steven and Demner-Fushman, Dina and Hersh, William R and Lo, Kyle and Roberts, Kirk and Soboroff, Ian and Wang, Lucy Lu , journal=
[51]

Zero-shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156, 2023

Zero-Shot Listwise Document Reranking with a Large Language Model , author=. arXiv preprint arXiv:2305.02156 , year=

work page arXiv
[52]

arXiv preprint arXiv:2310.13243 , year=

Open-source Large Language Models are Strong Zero-Shot Query Likelihood Models for Document Ranking , author=. arXiv preprint arXiv:2310.13243 , year=

work page arXiv
[53]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

arXiv preprint arXiv:2509.07485 , year=

Multi-view-guided Passage Reranking with Large Language Models , author=. arXiv preprint arXiv:2509.07485 , year=

work page arXiv