pith. machine review for the scientific record. sign in

arxiv: 2312.02724 · v1 · submitted 2023-12-05 · 💻 cs.IR

Recognition: no theorem link

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:36 UTC · model grok-4.3

classification 💻 cs.IR
keywords zero-shot rerankinglistwise rerankinglarge language modelsinformation retrievalopen-source modelsGPT-4 comparisonBEIR benchmarkTREC Deep Learning
0
0 comments X

The pith

An open-source LLM for listwise zero-shot reranking matches or surpasses GPT-4 on multiple retrieval benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RankZephyr as an open-source large language model specialized for zero-shot listwise reranking in information retrieval tasks. It shows that strategic training choices allow this model to close the performance gap with GPT-4 and sometimes exceed it on standard datasets including TREC Deep Learning tracks and BEIR collections such as NEWS and COVID. The work further demonstrates that the model remains stable when the initial ordering of documents changes or when the number of documents to rerank varies. By reporting stronger results than GPT-4 on the NovelEval set of post-cutoff queries and passages, the authors address potential data contamination issues while releasing full code for reproduction.

Core claim

RankZephyr is a state-of-the-art open-source LLM for listwise zero-shot reranking that not only bridges the effectiveness gap with GPT-4 but in some cases surpasses the proprietary model, with comprehensive evaluations across TREC Deep Learning Tracks and BEIR datasets confirming the result along with resilience to variations in initial document ordering and the number of documents reranked, plus superior performance on the NovelEval test set of post-training-cutoff material.

What carries the argument

RankZephyr, an open-source LLM fine-tuned for listwise zero-shot reranking prompts that benefits from strategic training choices to achieve robustness and high effectiveness.

If this is right

  • Open-source models become viable substitutes for proprietary ones in production reranking pipelines without sacrificing accuracy.
  • Reproducibility improves because full code and model weights are released for the community to inspect and extend.
  • Reranking performance holds steady even when upstream retrievers return documents in arbitrary order or when list lengths vary.
  • Concerns about data contamination can be directly tested by evaluating on freshly created query-passage pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fine-tuning recipe may transfer to other listwise tasks such as passage fusion or answer aggregation.
  • Smaller open-source base models could be tested with identical training to measure the minimum scale needed for competitive reranking.
  • Production search systems could adopt RankZephyr-style rerankers to reduce reliance on closed APIs while maintaining or improving result quality.
  • Future benchmarks should routinely include post-cutoff test sets to separate true generalization from memorization effects.

Load-bearing premise

The NovelEval test set contains only queries and passages created after the model's training cutoff with no leakage during fine-tuning or evaluation.

What would settle it

A new test collection of queries and passages created entirely after both models' training cutoffs where RankZephyr no longer matches or exceeds GPT-4 performance.

read the original abstract

In information retrieval, proprietary large language models (LLMs) such as GPT-4 and open-source counterparts such as LLaMA and Vicuna have played a vital role in reranking. However, the gap between open-source and closed models persists, with reliance on proprietary, non-transparent models constraining reproducibility. Addressing this gap, we introduce RankZephyr, a state-of-the-art, open-source LLM for listwise zero-shot reranking. RankZephyr not only bridges the effectiveness gap with GPT-4 but in some cases surpasses the proprietary model. Our comprehensive evaluations across several datasets (TREC Deep Learning Tracks; NEWS and COVID from BEIR) showcase this ability. RankZephyr benefits from strategic training choices and is resilient against variations in initial document ordering and the number of documents reranked. Additionally, our model outperforms GPT-4 on the NovelEval test set, comprising queries and passages past its training period, which addresses concerns about data contamination. To foster further research in this rapidly evolving field, we provide all code necessary to reproduce our results at https://github.com/castorini/rank_llm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RankZephyr, an open-source LLM fine-tuned for zero-shot listwise reranking. It claims to close or surpass the effectiveness gap with GPT-4 on TREC Deep Learning tracks, BEIR NEWS/COVID subsets, and especially the NovelEval test set (constructed with post-cutoff queries and passages to mitigate contamination). The work also reports robustness to initial document ordering and reranking list size, with full code released for reproducibility.

Significance. If the core claims hold after addressing verification gaps, this would be a meaningful contribution to IR by delivering a reproducible open-source model that matches or exceeds proprietary LLMs on listwise reranking, supported by robustness experiments and public code. The explicit focus on ordering sensitivity and list-size variation, plus the code release, are concrete strengths that facilitate follow-on work.

major comments (2)
  1. [NovelEval evaluation] NovelEval evaluation (abstract and corresponding results section): The headline claim that RankZephyr surpasses GPT-4 rests primarily on NovelEval results, yet the manuscript supplies no explicit timestamp audit, overlap check against the fine-tuning corpus, or ablation removing borderline items. This verification gap is load-bearing for the zero-shot outperformance interpretation.
  2. [Methods and results] Training and evaluation details (methods/results sections): The paper references 'strategic training choices' and reports benchmark results but omits the precise fine-tuning data mixture, hyperparameter values, exact metric definitions, and statistical significance tests. Without these, the soundness of the GPT-4 surpassing claims remains provisional.
minor comments (2)
  1. [Abstract] Abstract: The statement that RankZephyr 'in some cases surpasses' GPT-4 would be clearer if it named the specific datasets and metrics where this occurs.
  2. [Robustness experiments] Figure clarity: The robustness plots (ordering and list-size sensitivity) would benefit from explicit error bars or statistical annotations to support the 'resilient' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and will revise the manuscript accordingly to improve transparency and reproducibility.

read point-by-point responses
  1. Referee: [NovelEval evaluation] NovelEval evaluation (abstract and corresponding results section): The headline claim that RankZephyr surpasses GPT-4 rests primarily on NovelEval results, yet the manuscript supplies no explicit timestamp audit, overlap check against the fine-tuning corpus, or ablation removing borderline items. This verification gap is load-bearing for the zero-shot outperformance interpretation.

    Authors: We appreciate the referee's emphasis on rigorous verification for NovelEval. The dataset was constructed using queries and passages dated after the training cutoffs of the models evaluated (including GPT-4 and the base models for RankZephyr) specifically to reduce contamination risk, as stated in the manuscript. We agree that an explicit timestamp audit, overlap analysis, and any borderline-item ablation would further strengthen the presentation. In the revised manuscript we will expand the NovelEval section with these details, including the exact cutoff dates used, the overlap verification procedure against the fine-tuning corpus, and results of an ablation that removes any borderline items. These additions will be placed in the evaluation section and will not change the reported numbers. revision: yes

  2. Referee: [Methods and results] Training and evaluation details (methods/results sections): The paper references 'strategic training choices' and reports benchmark results but omits the precise fine-tuning data mixture, hyperparameter values, exact metric definitions, and statistical significance tests. Without these, the soundness of the GPT-4 surpassing claims remains provisional.

    Authors: We agree that the current manuscript is insufficiently explicit on these points. Although the released code repository contains the full training scripts, data files, and evaluation harness, the paper itself should document the precise mixture of fine-tuning data, all hyperparameter values, the exact definitions of the reported metrics (e.g., nDCG@10, MAP), and the statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) comparing RankZephyr against GPT-4. In the revised version we will add a dedicated subsection in Methods that enumerates the data mixture and hyperparameters, and we will augment the Results tables with significance markers and a short statistical appendix. These changes will make the GPT-4 comparison claims fully verifiable from the text alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks and released code

full rationale

The paper trains RankZephyr on listwise reranking data and reports empirical performance on independent public benchmarks (TREC DL, BEIR NEWS/COVID, NovelEval). No equations, parameters, or derivations are shown to reduce by construction to the inputs; the central effectiveness claims are not self-definitional, fitted predictions, or dependent on self-citation chains. NovelEval is presented as an external post-cutoff set, with code release enabling external verification. This is a standard empirical ML evaluation without load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim depends on the effectiveness of undisclosed fine-tuning choices and the assumption that NovelEval avoids contamination; no new mathematical axioms or invented entities are introduced.

free parameters (1)
  • fine-tuning hyperparameters and data mixture
    Strategic training choices are cited as the source of robustness but are not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5504 in / 1078 out tokens · 46457 ms · 2026-05-15T23:36:55.601641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FollowTable: A Benchmark for Instruction-Following Table Retrieval

    cs.IR 2026-05 unverdicted novelty 8.0

    FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond ...

  2. F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

    cs.LG 2026-05 unverdicted novelty 7.0

    F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked perfo...

  3. State-Centric Decision Process

    cs.AI 2026-05 unverdicted novelty 7.0

    SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.

  4. Very Efficient Listwise Multimodal Reranking for Long Documents

    cs.IR 2026-05 unverdicted novelty 7.0

    ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

  5. The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

    cs.LG 2026-05 unverdicted novelty 7.0

    On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

  6. Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models

    cs.IR 2026-05 unverdicted novelty 7.0

    CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking model...

  7. Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval

    cs.IR 2026-04 accept novelty 7.0

    Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.

  8. ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

    cs.IR 2026-04 conditional novelty 7.0

    ResRank unifies retrieval and listwise reranking by compressing passages to one token each, using residual connections and cosine-similarity scoring, achieving competitive effectiveness on TREC DL and BEIR benchmarks ...

  9. Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.

  10. Entities as Retrieval Signals: A Systematic Study of Coverage, Supervision, and Evaluation in Entity-Oriented Ranking

    cs.IR 2026-04 conditional novelty 6.0

    Entity signals cover only 19.7% of relevant documents on Robust04 and no configuration among 443 systems improves MAP by more than 0.05 in open-world evaluation, despite gains when entities are pre-restricted.

  11. Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking

    cs.IR 2026-02 unverdicted novelty 6.0

    Internal attention in LLMs shows a bell-curve relevance distribution across layers, enabling Selective-ICR that cuts inference latency 30-50% and lets an 8B zero-shot model match 14B RL re-rankers on BRIGHT.

  12. MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

    cs.CL 2026-05 unverdicted novelty 5.0

    MemReranker applies multi-stage distillation to Qwen3-Reranker to produce reasoning-aware rerankers that outperform baselines on memory tasks with temporal and causal constraints.

  13. Efficient Listwise Reranking with Compressed Document Representations

    cs.IR 2026-04 unverdicted novelty 5.0

    RRK compresses documents to multi-token embeddings for efficient listwise reranking, enabling an 8B model to achieve 3x-18x speedups over smaller models with comparable or better effectiveness.

  14. Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking

    cs.IR 2026-04 unverdicted novelty 5.0

    AdaRankLLM shows adaptive listwise reranking outperforms fixed-depth retrieval for most LLMs by acting as a noise filter for weak models and an efficiency optimizer for strong ones, with lower context use.

  15. Dynamic Ranked List Truncation for Reranking Pipelines via LLM-generated Reference-Documents

    cs.IR 2026-04 unverdicted novelty 5.0

    LLM-generated reference documents enable dynamic ranked list truncation and adaptive batching for listwise reranking, outperforming prior RLT methods and accelerating processing by up to 66% on TREC benchmarks.

  16. MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

    cs.CL 2026-05 unverdicted novelty 4.0

    MemReranker applies multi-teacher pairwise distillation, BCE pointwise training, and InfoNCE contrastive learning on mixed general and memory-specific dialogue data to produce efficient rerankers that improve calibrat...

  17. Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data

    cs.CL 2026-04 conditional novelty 4.0

    Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting R...

  18. A Reproducibility Study of Metacognitive Retrieval-Augmented Generation

    cs.IR 2026-04 unverdicted novelty 3.0

    MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.

  19. Reproducing Adaptive Reranking for Reasoning-Intensive IR

    cs.IR 2026-04 unverdicted novelty 2.0

    Reproducing GAR on BRIGHT shows it boosts reasoning-intensive retrieval effectiveness with low overhead when the reranker's signal quality is strong.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 18 Pith papers · 4 internal anchors

  1. [1]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2016. MS MARCO : A human generated machine reading comprehension dataset. arXiv:1611.09268v3

  2. [2]

    Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. InPars : Unsupervised dataset generation for information retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022), pages 2387--2392, Madrid, Spain

  3. [3]

    Leonid Boytsov, Preksha Patel, Vivek Sourabh, Riddhi Nisar, Sayani Kundu, Ramya Ramanathan, and Eric Nyberg. 2023. InPars-Light : Cost-effective unsupervised training of efficient rankers. arXiv:2301.02998

  4. [4]

    Barla Cambazoglu, Hugo Zaragoza, Olivier Chapelle, Jiang Chen, Ciya Liao, Zhaohui Zheng, and Jon Degenhardt

    B. Barla Cambazoglu, Hugo Zaragoza, Olivier Chapelle, Jiang Chen, Ciya Liao, Zhaohui Zheng, and Jon Degenhardt. 2010. Early exit optimizations for additive machine learned ranking systems. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), pages 411--420, New York, New York

  5. [5]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2020. Overview of the TREC 2020 deep learning track. In Proceedings of the Twenty-Ninth Text REtrieval Conference Proceedings (TREC 2020), Gaithersburg, Maryland

  6. [6]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin. 2021. Overview of the TREC 2021 deep learning track. In Proceedings of the Thirtieth Text REtrieval Conference (TREC 2021)

  7. [7]

    Voorhees, and Ian Soboroff

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 deep learning track. In Proceedings of the Thirty-First Text REtrieval Conference (TREC 2021), Gaithersburg, Maryland

  8. [8]

    Voorhees

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2019. Overview of the TREC 2019 deep learning track. In Proceedings of the Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019), Gaithersburg, Maryland

  9. [9]

    Hall, and Ming-Wei Chang

    Zhuyun Dai, Vincent Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot dense retrieval from 8 examples. arXiv:2209.11755

  10. [10]

    Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and St\' e phane Clinchant. 2022. From distillation to hard negative sampling: Making sparse neural IR models more effective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022), pages 2353--2359, Madrid, Spain

  11. [11]

    Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. Rethink training of BERT rerankers in multi-stage retrieval pipeline. In Proceedings of the 43rd European Conference on Information Retrieval (ECIR 2021)

  12. [12]

    Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1762--1777, Toronto, Canada

  13. [13]

    Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein

    Neel Jain, Ping yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2023. NEFTune : Noisy embeddings improve instruction finetuning. arXiv:2310.05914

  14. [14]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B . arXiv:2310.06825

  15. [15]

    Carlos Lassance, Ronak Pradeep, and Jimmy Lin. 2023. Naverloo @ TREC deep learning and NeuCLIR 2023: As easy as zero, one, two, three --- cascading dual encoders, mono, duo, and listo for ad-hoc retrieval. In Proceedings of the Thirty-Second Text REtrieval Conference (TREC 2023)

  16. [16]

    Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021 a . Pyserini : A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pa...

  17. [17]

    Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021 b . Pretrained Transformers for Text Ranking: BERT and Beyond . Morgan & Claypool Publishers

  18. [18]

    Jimmy Lin, Ronak Pradeep, Tommaso Teofili, and Jasper Xian. 2023. Vector search with OpenAI embeddings: Lucene is all you need. arXiv:2308.14963

  19. [19]

    Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2023 a . Fine-tuning LLaMA for multi-stage text retrieval. arXiv:2310.08319

  20. [20]

    Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. 2023 b . Zero-shot listwise document reranking with a large language model. arXiv:2305.02156

  21. [21]

    Irina Matveeva, Chris Burges, Timo Burkard, Andy Laucius, and Leon Wong. 2006. High accuracy retrieval with multiple nested ranker. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pages 437--444, Seattle, Washington

  22. [22]

    Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, P...

  23. [23]

    Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with BERT . arXiv:1901.04085

  24. [24]

    Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708--718

  25. [25]

    Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage document ranking with BERT . arXiv:1910.14424

  26. [26]

    Cicero Nogueira dos Santos, Xiaofei Ma, Ramesh Nallapati, Zhiheng Huang, and Bing Xiang. 2020. Beyond [ CLS ] through ranking by generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1722--1727, Online

  27. [27]

    Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and Vinh Q

    Ronak Pradeep, Kai Hui, Jai Gupta, Adam D. Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, and Vinh Q. Tran. 2023 a . How does generative retrieval scale to millions of passages? arXiv:2305.11841

  28. [28]

    Ronak Pradeep, Yilin Li, Yuetong Wang, and Jimmy Lin. 2022 a . Neural query synthesis and domain-specific ranking templates for multi-stage clinical trial matching. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022), pages 2325--2330, Madrid, Spain

  29. [29]

    Ronak Pradeep, Yuqi Liu, Xinyu Zhang, Yilin Li, Andrew Yates, and Jimmy Lin. 2022 b . Squeezing water from a stone: A bag of tricks for further improving cross-encoder effectiveness for reranking. In Proceedings of the 44th European Conference on Information Retrieval (ECIR 2022), Part I, pages 655--670, Stavanger, Norway

  30. [30]

    Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, and Jimmy Lin. 2021 a . Vera: Prediction techniques for reducing harmful misinformation in consumer health search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2066--2070

  31. [31]

    Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2021 b . The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models. arXiv:2101.05667

  32. [32]

    Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023 b . RankVicuna : Zero-shot listwise document reranking with open-source large language models. arXiv:2309.15088

  33. [33]

    Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. 2023. Large language models are effective text rankers with pairwise ranking prompting. arXiv:2306.17563

  34. [34]

    Robertson and Hugo Zaragoza

    Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333--389

  35. [35]

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT good at search? I nvestigating large language models as re-ranking agent. arXiv:2304.09542

  36. [36]

    Raphael Tang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, and Ferhan Ture. 2023. Found in the middle: Permutation self-consistency improves listwise ranking in large language models. arXiv:2310.07712

  37. [37]

    Nandan Thakur, Nils Reimers, Andreas R \"u ckl \'e , Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  38. [38]

    Rush, and Thomas Wolf

    Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. 2023. Zephyr: Direct distillation of LM alignment. arXiv:2310.16944

  39. [39]

    Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011), pages 105--114, Beijing, China

  40. [40]

    Honglei Zhuang, Zhen Qin, Kai Hui, Junru Wu, Le Yan, Xuanhui Wang, and Michael Bendersky. 2023 a . Beyond yes and no: Improving zero-shot LLM rankers via scoring fine-grained relevance labels. arXiv:2310.14122

  41. [41]

    Honglei Zhuang, Zhen Qin, Rolf Jagerman, Kai Hui, Ji Ma, Jing Lu, Jianmo Ni, Xuanhui Wang, and Michael Bendersky. 2022. RankT5 : Fine-tuning T5 for text ranking with ranking losses. arXiv:2210.10634

  42. [42]

    Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. 2023 b . A setwise approach for effective and highly efficient zero-shot ranking with large language models. arXiv:2310.09497