pith. sign in

arxiv: 2502.00709 · v5 · submitted 2025-02-02 · 💻 cs.IR

RankFlow: A Multi-Role Collaborative Reranking Workflow Utilizing Large Language Models

Pith reviewed 2026-05-23 04:33 UTC · model grok-4.3

classification 💻 cs.IR
keywords RankFlowrerankinglarge language modelsinformation retrievalmulti-role workflowTREC-DLBEIRpassage ranking
0
0 comments X

The pith

RankFlow assigns LLMs four specialized roles in sequence to improve passage reranking for queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RankFlow as a workflow that breaks reranking into four LLM roles: rewriting the query for clarity, generating a pseudo answer from knowledge, summarizing passages to essentials, and finally ranking based on all prior outputs. This division lets each step focus on one aspect of understanding relevance between query and passage. If the approach holds, it produces more accurate ordering of candidate passages than single-prompt or single-model rerankers. Experiments on TREC-DL, BEIR, and NovelEval show gains over prior leading methods, with additional analysis of each role's contribution.

Core claim

RankFlow enlists LLMs to fulfill four distinct roles: the query Rewriter, the pseudo Answerer, the passage Summarizer, and the Reranker. This orchestrated approach enables RankFlow to accurately interpret queries, draw upon LLMs' extensive pre-existing knowledge, distill passages into concise versions, and assess passages in a comprehensive manner, resulting in notably better reranking results on TREC-DL, BEIR, and NovelEval.

What carries the argument

The RankFlow workflow that sequences four LLM roles (query Rewriter, pseudo Answerer, passage Summarizer, Reranker) to produce the final ranked list.

If this is right

  • Query rewriting produces clearer inputs that improve downstream relevance judgments.
  • Pseudo answering injects the LLM's stored knowledge into the ranking decision.
  • Passage summarization reduces noise so the reranker focuses on core content.
  • The final reranker integrates signals from the three prior stages for more complete assessment.
  • The combined workflow exceeds prior top methods on TREC-DL, BEIR, and NovelEval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Role separation makes it possible to measure and improve each stage independently without retraining the entire system.
  • The same division of labor could be applied to other retrieval stages such as initial candidate generation.

Load-bearing premise

Large language models can reliably carry out the four roles in order without errors from earlier stages compounding and degrading later ones.

What would settle it

Run the workflow but replace the output of the Rewriter or Summarizer with deliberately incorrect or random text and measure whether final NDCG or recall on TREC-DL drops sharply compared with the reported results.

Figures

Figures reproduced from arXiv: 2502.00709 by Anxiang Zhang, Caiwen Ding, Can Jin, Dimitris N. Metaxas, Hongwu Peng, Jiahui Zhao, Kai Zhong, Kuangzheng Li, Nuo Chen, Shuya Feng, Xi Xie.

Figure 1
Figure 1. Figure 1: Overview of RankFlow . RankFlow is composed of four well-defined expert roles: Rewriter, Answerer, Summarizer, and Reranker, each designed to address specific issues in passage reranking. These roles work sequentially to handle the ranking task [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

In an Information Retrieval (IR) system, reranking plays a critical role by sorting candidate passages according to their relevance to a specific query. This process demands a nuanced understanding of the variations among passages linked to the query. In this work, we introduce RankFlow, a multi-role reranking workflow that leverages the capabilities of Large Language Models (LLMs) and role specializations to improve reranking performance. RankFlow enlists LLMs to fulfill four distinct roles: the query Rewriter, the pseudo Answerer, the passage Summarizer, and the Reranker. This orchestrated approach enables RankFlow to: (1) accurately interpret queries, (2) draw upon LLMs' extensive pre-existing knowledge, (3) distill passages into concise versions, and (4) assess passages in a comprehensive manner, resulting in notably better reranking results. Our experimental results reveal that RankFlow outperforms existing leading approaches on widely recognized IR benchmarks, such as TREC-DL, BEIR, and NovelEval. Additionally, we investigate the individual contributions of each role in RankFlow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RankFlow, a multi-role collaborative reranking workflow for information retrieval systems. LLMs are assigned four specialized roles—query Rewriter, pseudo Answerer, passage Summarizer, and Reranker—to interpret queries, leverage pre-trained knowledge, condense passages, and perform final relevance assessment. The authors claim this orchestrated workflow outperforms leading reranking approaches on TREC-DL, BEIR, and NovelEval benchmarks and provide an analysis of each role's individual contribution.

Significance. If the empirical results are robust, the work would add to LLM-based IR research by demonstrating the potential benefits of explicit role specialization and multi-stage orchestration in reranking pipelines. The explicit investigation of per-role contributions is a constructive element that could guide subsequent workflow designs in the field.

major comments (2)
  1. [experimental evaluation and role-contribution analysis] The central outperformance claim on TREC-DL, BEIR, and NovelEval rests on the premise that the four-role workflow yields net gains without substantial error propagation from intermediate LLM outputs (e.g., inaccurate rewrites, hallucinated pseudo-answers, or lossy summaries). The manuscript states that individual role contributions were investigated, yet supplies no quantitative metrics on role-level fidelity, inter-role consistency, or controlled ablations that inject errors at specific stages to isolate workflow effects from base-LLM strength. This verification is load-bearing for attributing gains to the orchestrated structure rather than the underlying model.
  2. [abstract and results] The abstract asserts that RankFlow 'outperforms existing leading approaches' on the cited benchmarks, but the provided description contains no numerical results, tables of metrics (e.g., nDCG@10, MRR), error bars, or statistical tests. The results section must be examined to confirm that reported improvements are statistically significant and not attributable to prompt sensitivity or model choice alone.
minor comments (2)
  1. [methodology] Provide the exact prompt templates used for each of the four roles so that the workflow is fully reproducible.
  2. [experimental setup] Include version numbers, query/passages splits, and any preprocessing steps for the TREC-DL, BEIR, and NovelEval collections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications on our experimental design and results presentation.

read point-by-point responses
  1. Referee: The central outperformance claim on TREC-DL, BEIR, and NovelEval rests on the premise that the four-role workflow yields net gains without substantial error propagation from intermediate LLM outputs. The manuscript states that individual role contributions were investigated, yet supplies no quantitative metrics on role-level fidelity, inter-role consistency, or controlled ablations that inject errors at specific stages to isolate workflow effects from base-LLM strength.

    Authors: We agree that demonstrating the workflow's contribution beyond base LLM capabilities is important. Our manuscript includes ablation studies that remove or alter individual roles (Rewriter, Answerer, Summarizer, Reranker) and report resulting performance drops on the benchmarks, supporting the value of the orchestrated structure. However, we did not provide explicit quantitative metrics on role-level fidelity (e.g., rewrite accuracy) or controlled experiments injecting errors into intermediate stages. We will add these analyses, including fidelity measurements and error-injection ablations, in the revised manuscript to better isolate workflow effects. revision: yes

  2. Referee: The abstract asserts that RankFlow 'outperforms existing leading approaches' on the cited benchmarks, but the provided description contains no numerical results, tables of metrics (e.g., nDCG@10, MRR), error bars, or statistical tests. The results section must be examined to confirm that reported improvements are statistically significant and not attributable to prompt sensitivity or model choice alone.

    Authors: The abstract serves as a concise summary and conventionally omits specific numerical values. The Experiments section presents full results with tables reporting nDCG@10, MRR, and other metrics across TREC-DL, BEIR, and NovelEval, including comparisons to leading baselines. Statistical significance tests are included for key improvements. Experiments use fixed prompts and multiple model configurations to address sensitivity concerns; we can expand discussion of these controls if needed but believe the current presentation is sufficient. revision: no

Circularity Check

0 steps flagged

No circularity; empirical workflow with benchmark validation

full rationale

The paper introduces RankFlow as a multi-role LLM workflow (Rewriter, pseudo Answerer, Summarizer, Reranker) and reports experimental outperformance on TREC-DL, BEIR, and NovelEval. No equations, derivations, or predictions appear that reduce claimed gains to fitted parameters, self-definitions, or self-citation chains by construction. Role contributions are investigated empirically rather than asserted via uniqueness theorems or ansatzes imported from prior author work. The central claim rests on external benchmark results, making the derivation self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, parameters, or explicit assumptions; the workflow implicitly assumes LLMs possess sufficient knowledge and instruction-following ability to execute each role accurately.

pith-pipeline@v0.9.0 · 5757 in / 1157 out tokens · 31025 ms · 2026-05-23T04:33:42.435587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 8 internal anchors

  1. [1]

    Nasreen Abdul-Jaleel, James Allan, W Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. Computer Science Department Faculty Publication Series (2004), 189

  2. [2]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. [3]

    AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta-llama/ llama3/blob/main/MODEL_CARD.md

  4. [4]

    Marwah Alaofi, Luke Gallagher, Mark Sanderson, Falk Scholer, and Paul Thomas

  5. [5]

    In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

    Can generative llms create query variants for test collections? an ex- ploratory study. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval . 1869–1873

  6. [6]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2023. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 (2023)

  7. [7]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)

  8. [8]

    R Meredith Belbin and Victoria Brown. 2022. Team roles at work. Routledge

  9. [9]

    Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. Inpars: Data augmentation for information retrieval using large language models. arXiv preprint arXiv:2202.05144 (2022)

  10. [10]

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Ruther- ford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bog- dan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning . PMLR, 2206–2240

  11. [11]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

  12. [12]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020)

  13. [13]

    Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot Dense Retrieval From 8 Examples. In The Eleventh International Conference on Learning Representations

  14. [14]

    Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2008. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur) 40, 2 (2008), 1–60

  15. [15]

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch

  16. [16]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325 (2023)

  17. [17]

    Wenqi Fan, Zihuai Zhao, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Jiliang Tang, and Qing Li. 2023. Recommender systems in the era of large language models (llms). arXiv preprint arXiv:2307.02046 (2023)

  18. [18]

    Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. Precise Zero-Shot Dense Retrieval without Relevance Labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 1762–1777

  19. [19]

    Jiashu He, Charilaos I Kanatsoulis, and Alejandro Ribeiro. 2023. T-GAE: Trans- ferable Graph Autoencoder for Network Alignment. arXiv e-prints (2023), arXiv– 2310

  20. [20]

    Jiashu He, Mingyu Derek Ma, Jinxuan Fan, Dan Roth, Wei Wang, and Alejandro Ribeiro. 2024. GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation. arXiv preprint arXiv:2410.08475 (2024)

  21. [21]

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al . 2023. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2023)

  22. [22]

    Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2024. Large language models are zero-shot rankers for recommender systems. In European Conference on Information Retrieval. Springer, 364–381

  23. [23]

    Jeff Huang and Efthimis N Efthimiadis. 2009. Analyzing and evaluating query reformulation strategies in web search logs. In Proceedings of the 18th ACM conference on Information and knowledge management . 77–86

  24. [24]

    Rolf Jagerman, Honglei Zhuang, Zhen Qin, Xuanhui Wang, and Michael Bender- sky. 2023. Query expansion by prompting large language models. arXiv preprint arXiv:2305.03653 (2023)

  25. [25]

    Can Jin, Tong Che, Hongwu Peng, Yiyuan Li, Dimitris Metaxas, and Marco Pavone. 2024. Learning from Teaching Regularization: Generaliz- able Correlations Should be Easy to Imitate. In Advances in Neural Infor- mation Processing Systems , A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, In...

  26. [26]

    Can Jin, Hongwu Peng, Qixin Zhang, Yujin Tang, Dimitris N Metaxas, and Tong Che. 2025. Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning. arXiv preprint arXiv:2504.09772 (2025)

  27. [27]

    Can Jin, Hongwu Peng, Shiyu Zhao, Zhenting Wang, Wujiang Xu, Ligong Han, Jiahui Zhao, Kai Zhong, Sanguthevar Rajasekaran, and Dimitris N Metaxas

  28. [28]

    arXiv preprint arXiv:2406.14449 (2024)

    APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking. arXiv preprint arXiv:2406.14449 (2024)

  29. [29]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213

  30. [30]

    Mosh Levy, Alon Jacoby, and Yoav Goldberg. 2024. Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models. arXiv preprint arXiv:2402.14848 (2024)

  31. [31]

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . 7871–7880

  32. [32]

    Zhengyang Li, Qijin Ji, Xinghong Ling, and Quan Liu. 2025. A Comprehensive Review of Multi-Agent Reinforcement Learning in Video Games. Authorea Preprints (2025)

  33. [33]

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al

  34. [34]

    Holistic Evaluation of Language Models

    Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022)

  35. [35]

    Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python toolkit for reproducible infor- mation retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2356–2362

  36. [36]

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 35 (2022), 1950–1965

  37. [37]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35

  38. [38]

    Shicheng Liu and Minghui Zhu. 2022. Distributed inverse constrained reinforce- ment learning for multi-agent systems. Advances in Neural Information Processing Systems 35 (2022), 33444–33456

  39. [39]

    Shicheng Liu and Minghui Zhu. 2024. Learning Multi-agent Behaviors from Distributed and Streaming Demonstrations. Advances in Neural Information Processing Systems 36 (2024)

  40. [40]

    Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query Rewriting in Retrieval-Augmented Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 5303–5315

  41. [41]

    Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. 2023. Zero- shot listwise document reranking with a large language model. arXiv preprint arXiv:2305.02156 (2023)

  42. [42]

    Kelong Mao, Zhicheng Dou, Fengran Mo, Jiewen Hou, Haonan Chen, and Hongjin Qian. 2023. Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search. In Findings of the Association for Computational Linguistics: EMNLP 2023 . 1211–1225

  43. [43]

    Donald Metzler and W Bruce Croft. 2005. A markov random field model for term dependencies. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval . 472–479

  44. [44]

    Donald Metzler and W Bruce Croft. 2007. Latent concept expansion using markov random fields. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval . 311–318

  45. [45]

    Zhijie Nie, Richong Zhang, Zhongyuan Wang, and Xudong Liu. 2024. Code-style in-context learning for knowledge-based question answering. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 18833–18841

  46. [46]

    Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Docu- ment Ranking with a Pretrained Sequence-to-Sequence Model. In Findings of the Association for Computational Linguistics: EMNLP 2020 . 708–718

  47. [47]

    Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage document ranking with BERT. arXiv preprint arXiv:1910.14424 (2019)

  48. [48]

    Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019)

  49. [49]

    OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt. WWW Companion ’25, April 28-May 2, 2025, Sydney, NSW, Australia Can Jin et al

  50. [50]

    Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! arXiv preprint arXiv:2312.02724 (2023)

  51. [51]

    Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, et al. 2023. Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563 (2023)

  52. [52]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9

  53. [53]

    Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. Nist Special Publication Sp 109 (1995), 109

  54. [54]

    Devendra Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen-tau Yih, Joelle Pineau, and Luke Zettlemoyer. 2022. Improving Passage Retrieval with Zero-Shot Question Generation. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing . 3781–3797

  55. [55]

    Weiwei Sun, Zheng Chen, Xinyu Ma, Lingyong Yan, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Instruction distilla- tion makes large language models efficient zero-shot rankers. arXiv preprint arXiv:2311.01555 (2023)

  56. [56]

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investi- gating Large Language Models as Re-Ranking Agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . 14918–14937

  57. [57]

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  58. [58]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  59. [59]

    Liang Wang, Nan Yang, and Furu Wei. 2023. Query2doc: Query Expansion with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . 9414–9423

  60. [60]

    Zhongyuan Wang, Richong Zhang, Zhijie Nie, and Jaein Kim. 2024. Tool-assisted agent on sql inspection and refinement in real-world scenarios. arXiv preprint arXiv:2408.16991 (2024)

  61. [61]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837

  62. [62]

    Yijie Weng and Jianhao Wu. 2024. Leveraging Artificial Intelligence to Enhance Data Security and Combat Cyber Attacks.Journal of Artificial Intelligence General science (JAIGS) ISSN: 3006-4023 5, 1 (2024), 392–399. doi:10.60087/jaigs.v5i1.211

  63. [63]

    Yijie Weng, Jianhao Wu, Tara Kelly, and William Johnson. 2024. Comprehensive Overview of Artificial Intelligence Applications in Modern Industries. arXiv preprint arXiv:2409.13059 (2024). doi:10.48550/arXiv.2409.13059

  64. [64]

    Michael Wooldridge and Nicholas R Jennings. 1998. Pitfalls of agent-oriented development. In Proceedings of the second international conference on Autonomous agents. 385–391

  65. [65]

    Huiwen Wu, Xiaohan Li, Deyi Zhang, Xiaogang Xu, Jiafei Wu, Puning Zhao, and Zhe Liu. 2024. CG-FedLLM: How to Compress Gradients in Federated Fune- tuning for Large Language Models. arXiv preprint arXiv:2405.13746 (2024)

  66. [66]

    Likang Wu, Zhi Li, Hongke Zhao, Zhenya Huang, Yongqiang Han, Junji Jiang, and Enhong Chen. 2024. Supporting your idea reasonably: A knowledge-aware topic reasoning strategy for citation recommendation. IEEE Transactions on Knowledge and Data Engineering (2024)

  67. [67]

    Yunjia Xi, Weiwen Liu, Jianghao Lin, Jieming Zhu, Bo Chen, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. 2023. Towards open-world recommen- dation with knowledge augmentation from large language models.arXiv preprint arXiv:2306.10933 (2023)

  68. [68]

    Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Yongdong Zhang, and Zhendong Mao. 2023. Expertprompting: Instructing large language models to be distinguished experts. arXiv preprint arXiv:2305.14688 (2023)

  69. [69]

    Hao Xu, Xiangru Jian, Xinjian Zhao, Wei Pang, Chao Zhang, Suyuchen Wang, Qixin Zhang, Joao Monteiro, Qiuzhuang Sun, and Tianshu Yu. 2025. GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks. arXiv preprint arXiv:2504.12764 (2025)

  70. [70]

    Chang Yu, Yongshun Xu, Jin Cao, Ye Zhang, Yixin Jin, and Mengran Zhu. 2024. Credit card fraud detection using advanced transformer model. In 2024 IEEE International Conference on Metaverse Computing, Networking, and Applications (MetaCom). IEEE, 343–350

  71. [71]

    Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, S Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. 2023. Generate rather than Retrieve: Large Language Models are Strong Context Generators. InInternational Conference on Learning Representations

  72. [72]

    Chengxiang Zhai and John Lafferty. 2001. Model-based feedback in the lan- guage modeling approach to information retrieval. In Proceedings of the tenth international conference on Information and knowledge management . 403–410

  73. [73]

    Qixin Zhang, Zengde Deng, Zaiyi Chen, Haoyuan Hu, and Yu Yang. 2022. Stochas- tic continuous submodular maximization: Boosting via non-oblivious function. In International Conference on Machine Learning . PMLR, 26116–26134

  74. [74]

    Qixin Zhang, Zengde Deng, Zaiyi Chen, Kuangqi Zhou, Haoyuan Hu, and Yu Yang. 2023. Online learning for non-monotone DR-submodular maximization: From full information to bandit feedback. In International Conference on Artificial Intelligence and Statistics. PMLR, 3515–3537

  75. [75]

    Qixin Zhang, Zongqi Wan, Yu Yang, Li Shen, and Dacheng Tao. 2025. Near- Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Ap- proximation and Communication Efficiency. arXiv preprint arXiv:2502.05028 (2025)

  76. [76]

    Puning Zhao, Lifeng Lai, Li Shen, Qingming Li, Jiafei Wu, and Zhe Liu. 2024. A Huber Loss Minimization Approach to Mean Estimation under User-level Differential Privacy. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=TutGINeJzZ

  77. [77]

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 16816–16825

  78. [78]

    Tong Zhou, Jiahui Zhao, Yukui Luo, Xi Xie, Wujie Wen, Caiwen Ding, and Xiaolin Xu. 2024. AdaPI: Facilitating DNN Model Adaptivity for Efficient Private Inference in Edge Computing. CoRR (2024)

  79. [79]

    Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chen- long Deng, Zhicheng Dou, and Ji-Rong Wen. 2023. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023)

  80. [80]

    Honglei Zhuang, Zhen Qin, Rolf Jagerman, Kai Hui, Ji Ma, Jing Lu, Jianmo Ni, Xuanhui Wang, and Michael Bendersky. 2023. Rankt5: Fine-tuning t5 for text ranking with ranking losses. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval . 2308–2313

Showing first 80 references.