arxiv: 2605.11254 · v1 · submitted 2026-05-11 · 💻 cs.IR

Recognition: no theorem link

MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval

Daniel Hienert, Derek Greene, Dwaipayan Roy, Mehmet Deniz T\"urkmen, Philipp Mayr, Suchana Datta

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:52 UTC · model grok-4.3

classification 💻 cs.IR

keywords information retrieval benchmarkmulti-category retrievalLLM-assisted evaluationsocial science searchcategory-aware rankingtest collectionrelevance assessment

0 comments

The pith

MIRA benchmark uses LLMs to create a test collection for evaluating retrieval across four categories of scholarly items from real user queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern search systems are expected to retrieve seamlessly from diverse data sources and formats in a single interface, yet current IR benchmarks lack test collections that capture this heterogeneity. The paper introduces MIRA to close that gap by building a collection on a real social science platform that spans Publications, Research Data, Variables, and Instruments & Tools. Real user queries form the basis, while a large language model generates topic descriptions, narratives, and relevance judgments to keep construction costs low. This produces a unified framework for testing category-aware ranking. A reader would care because it supplies a ready-made resource for developing and comparing systems that handle mixed scholarly content without forcing separate searches per category.

Core claim

MIRA is a novel benchmark based on a large-scale social science search platform designed for category-aware ranking across heterogeneous categories—Publications, Research Data, Variables, and Instruments & Tools—within a single unified evaluation framework. The collection draws on real user queries, covers items from four distinct categories, and uses a large language model to generate topic descriptions and narratives along with relevance assessments, which substantially reduces the labor and cost of test collection creation.

What carries the argument

LLM-assisted generation of topic descriptions, narratives, and relevance judgments applied to real user queries from a social science platform, producing a multi-category test collection that supports unified evaluation across four heterogeneous item types.

If this is right

It supplies a ready resource for testing retrieval systems that must rank items from multiple scholarly categories together.
It supports research on category-aware and cross-category ranking algorithms in one evaluation setup.
It lowers the cost of creating future multi-category test collections by shifting labor to automated LLM steps.
It acts as a foundational testbed for work on multi-faceted, integrated, or heterogeneous information retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LLM-assisted construction process could be replicated in other domains that mix document types, such as legal or medical search.
Performance comparisons on MIRA could highlight whether systems benefit from modeling interactions between the four categories rather than treating them independently.
If the approach scales, it may shift benchmark creation away from purely manual annotation toward hybrid human-LLM pipelines across the field.
Ongoing use of the underlying platform could allow periodic regeneration of the collection with fresh queries to keep the benchmark current.

Load-bearing premise

Large language model outputs for topic descriptions, narratives, and relevance assessments are accurate and unbiased enough to serve as reliable ground truth without detailed human validation.

What would settle it

A side-by-side human review of a random sample of the LLM-assigned relevance labels that shows low agreement rates with expert judgments would demonstrate the benchmark's ground truth is not dependable.

Figures

Figures reproduced from arXiv: 2605.11254 by Daniel Hienert, Derek Greene, Dwaipayan Roy, Mehmet Deniz T\"urkmen, Philipp Mayr, Suchana Datta.

**Figure 2.** Figure 2: A topic (‘immigration’) description from the MIRA dataset. further augmented the candidate set using BM25 and ColBERT with a supervised expanded form of the query, thereby incorporating retrieval-based candidates in addition to interaction-derived documents. A detailed discussion of the pooling techniques employed is provided in the project GitHub repository. Relevance judgments were then automatically ge… view at source ↗

read the original abstract

Users increasingly expect modern search systems to offer a unified interface that seamlessly retrieves information from diverse data sources and formats. However, current information retrieval (IR) evaluation benchmarks have not kept pace with this development, primarily due to the lack of test collections that represent the diversity of contemporary search domains. We address this critical gap with MIRA, a novel benchmark based on a large-scale social science search platform. MIRA is designed for category-aware ranking across heterogeneous categories - Publications, Research Data, Variables, and Instruments & Tools - within a single, unified evaluation framework. The proposed collection is distinctive in several ways: (1) it is built upon real user queries, providing a more realistic basis for evaluation; (2) it covers scholarly items from four distinct categories, enabling multi-faceted evaluation; and (3) it leverages a Large Language Model to generate topic descriptions and narratives, as well as for relevance assessment with respect to these topics, substantially reducing the labor and cost of test collection generation. We release this resource to benefit the community by providing a foundational testbed for the research on multi-faceted, category-aware, integrated, or cross-category information retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents MIRA, a new IR benchmark for multi-category integrated retrieval constructed from real user queries on a large-scale social science search platform. It covers four heterogeneous categories (Publications, Research Data, Variables, and Instruments & Tools) within a unified framework, using an LLM to generate topic descriptions, narratives, and relevance assessments in order to reduce the cost of test collection creation while enabling category-aware ranking evaluation.

Significance. If the LLM-generated labels can be shown to be sufficiently accurate, MIRA would provide a valuable, realistic testbed that addresses the current lack of benchmarks supporting unified evaluation across diverse scholarly data types. This could meaningfully advance research on category-aware and integrated retrieval systems that match modern user expectations for seamless multi-source search.

major comments (1)

[§4 and §5] §4 (Benchmark Construction) and §5 (Relevance Assessment): The high-level description of LLM-assisted topic generation and relevance labeling is provided, but no human validation, inter-annotator agreement statistics, error analysis, or category-specific calibration results are reported. Because the central claim is that MIRA supplies a usable, high-quality ground truth for cross-category ranking, the absence of these checks leaves open the possibility of systematic LLM biases (e.g., favoring textual over structured items) that would invalidate comparative metrics.

minor comments (2)

[Abstract and §7] The release statement in the abstract and conclusion should include a persistent identifier (e.g., DOI or GitHub release tag) and explicit licensing terms for the queries, topics, and judgments.
[Table 1] Table 1 (category statistics) would benefit from an additional column reporting the number of unique queries per category to clarify the balance of the test collection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. The feedback highlights an important aspect of benchmark quality that we will address in the revision.

read point-by-point responses

Referee: [§4 and §5] §4 (Benchmark Construction) and §5 (Relevance Assessment): The high-level description of LLM-assisted topic generation and relevance labeling is provided, but no human validation, inter-annotator agreement statistics, error analysis, or category-specific calibration results are reported. Because the central claim is that MIRA supplies a usable, high-quality ground truth for cross-category ranking, the absence of these checks leaves open the possibility of systematic LLM biases (e.g., favoring textual over structured items) that would invalidate comparative metrics.

Authors: We agree that the absence of reported human validation leaves the quality of the LLM-generated labels insufficiently substantiated for a benchmark paper. The current manuscript emphasizes the construction pipeline and cost-reduction benefits but does not include the requested checks. In the revised version we will add a dedicated validation section (new §5.3) that reports: (i) human evaluation of a stratified sample of 200 topics and 1,000 relevance judgments by two independent annotators, (ii) inter-annotator agreement statistics (Cohen’s κ and percentage agreement) broken down by category, (iii) a qualitative error analysis of LLM–human disagreements, and (iv) per-category accuracy and bias diagnostics to test for systematic favoritism toward textual versus structured items. We will also release the annotation guidelines and sampled judgments as supplementary material. These additions will directly address the concern that unverified labels could invalidate cross-category comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: MIRA is a new resource creation with no derivation chain

full rationale

The paper introduces MIRA as a benchmark built from real user queries on a social science platform, using LLM assistance to generate topic descriptions, narratives, and relevance labels across four categories. No equations, fitted parameters, predictions, or self-referential derivations appear in the abstract or described content. The construction is presented as an independent methodological contribution for multi-category IR evaluation, with no load-bearing steps that reduce to prior outputs by definition or self-citation. Claims rest on the novelty of the collection itself rather than any tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the representativeness of one social science platform and the adequacy of LLM labeling; no free parameters or invented entities are introduced.

axioms (1)

domain assumption LLM-generated relevance judgments can substitute for human annotations in creating a reliable multi-category test collection
Invoked when stating that LLM use substantially reduces labor and cost while maintaining quality.

pith-pipeline@v0.9.0 · 5521 in / 1203 out tokens · 43854 ms · 2026-05-13T01:52:12.765120+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 1 internal anchor

[1]

[n. d.]. CLEF. https://www.clef-initiative.eu/. Accessed: 2026-02-12

work page 2026
[2]

[n. d.]. FIRE. https://fire.irsi.org.in/fire/2025/home. Accessed: 2026-02-12

work page 2025
[3]

[n. d.]. GESIS Search. https://search.gesis.org/. Accessed: 2026-02-12

work page 2026
[4]

[n. d.]. NTCIR. https://research.nii.ac.jp/ntcir/index-en.html. Accessed: 2026-02- 12

work page 2026
[5]

[n. d.]. TREC. https://trec.nist.gov/. Accessed: 2026-02-12

work page 2026
[6]

Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC rel- evance assessment.Inf. Process. Manage.48, 6 (Nov. 2012), 1053–1066. https: //doi.org/10.1016/j.ipm.2012.01.004

work page doi:10.1016/j.ipm.2012.01.004 2012
[7]

Negar Arabzadeh and Charles L. A. Clarke. 2025. Benchmarking LLM-based Relevance Judgment Methods. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy) (SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 3194–3204. https://doi.org/10.1145/3726302.3730305

work page doi:10.1145/3726302.3730305 2025
[8]

Debanjali Biswas, Endri Gupta, Ran Yu, and Benjamin Zapilko. 2025. GESIS Knowledge Graph (GESIS KG). (Version 2.0.0) [Data set]. GESIS, Cologne. https: //doi.org/10.7802/2969

work page doi:10.7802/2969 2025
[9]

Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. InPars: Unsupervised Dataset Generation for Information Retrieval. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval(Madrid, Spain)(SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2387–2392. https:/...

work page doi:10.1145/3477495 2022
[10]

Ricardo J. G. B. Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. InProceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2013), Part II (Lecture Notes in Computer Science, Vol. 7819), Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda, and Guan...

work page doi:10.1007/978-3-642-37456-2_14 2013
[11]

Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis- Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2019. Dataset search: a survey. The VLDB Journal29, 1 (Aug. 2019), 251–272. https://doi.org/10.1007/s00778- 019-00564-x

work page doi:10.1007/s00778- 2019
[12]

J Chen, N Wang, C Li, B Wang, S Xiao, H Xiao, H Liao, D Lian, and Z Liu. [n. d.]. Air-bench: Automated heterogeneous information retrieval benchmark (2024)

work page 2024
[13]

1997.The Cranfield tests on index language devices

Cyril Cleverdon. 1997.The Cranfield tests on index language devices. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 47–59

work page 1997
[14]

Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological Bulletin70, 4 (1968), 213–220. https://doi.org/10.1037/h0026256

work page doi:10.1037/h0026256 1968
[15]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin

work page
[16]

InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21)

MS MARCO: Benchmarking Ranking Models in the Large-Data Regime. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). As- sociation for Computing Machinery, New York, NY, USA, 1566–1576. https: //doi.org/10.1145/3404835.3462804

work page doi:10.1145/3404835.3462804
[17]

In: Diaz, F., Shah, C., Suel, T., Castells, P., Jones, R., Sakai, T

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees, and Ian Soboroff. 2021. TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Ma...

work page doi:10.1145/3404835.3463249 2021
[18]

Laura Dietz, Manisha Verma, Filip Radlinski, and Nick Craswell. 2017. TREC Complex Answer Retrieval Overview. InProceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, November 15-17, 2017 (NIST Special Publication, Vol. 500-324), Ellen M. Voorhees and Angela Ellis (Eds.). National Institute of Standards and Tech...

work page 2017
[19]

Kuicai Dong, Yujing Chang, Derrick Goh Xin Deik, Dexun Li, Ruiming Tang, and Yong Liu. 2025. MMDocIR: Benchmarking Multimodal Retrieval for Long Documents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 30959–30993

work page 2025
[20]

Alessandro Epasto, Jon Feldman, Silvio Lattanzi, Stefano Leonardi, and Vahab Mirrokni. 2014. Reduce and aggregate: similarity ranking in multi-categorical bipartite graphs. InProceedings of the 23rd International Conference on World Wide Web(Seoul, Korea)(WWW ’14). Association for Computing Machinery, New York, NY, USA, 349–360. https://doi.org/10.1145/25...

work page doi:10.1145/2566486.2568025 2014
[21]

Guglielmo Faggioli, Laura Dietz, Charles L. A. Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. InProceedings of the 2023 ACM SI- GIR International Conference on Theory of Information Retrieva...

work page doi:10.1145/3578337.3605136 2023
[22]

Naghmeh Farzi and Laura Dietz. 2025. Criteria-Based LLM Relevance Judgments. InProceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)(Padua, Italy)(ICTIR ’25). Association for Computing Machinery, New York, NY, USA, 254–263

work page 2025
[23]

Yaming Fu, Elizabeth Lomas, and Charles Inskip. 2021. Library log analysis and its implications for studying online information seeking behavior of cultural groups.The Journal of Academic Librarianship47, 5 (Sept. 2021), 102421. https: //doi.org/10.1016/j.acalib.2021.102421

work page doi:10.1016/j.acalib.2021.102421 2021
[24]

Mingqi Gao, Xinyu Hu, Xunjian Yin, Jie Ruan, Xiao Pu, and Xiaojun Wan

work page
[25]

https://doi.org/10.1162/coli_a_00561 arXiv:https://direct.mit.edu/coli/article-pdf/51/2/661/2513169/coli_a_00561.pdf

LLM-based NLG Evaluation: Current Status and Challenges.Compu- tational Linguistics51, 2 (06 2025), 661–687. https://doi.org/10.1162/coli_a_00561 arXiv:https://direct.mit.edu/coli/article-pdf/51/2/661/2513169/coli_a_00561.pdf

work page doi:10.1162/coli_a_00561 2025
[26]

Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794 [cs.CL] https://arxiv.org/abs/2203.05794

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Daniel Hienert. 2017. User interests in German social science literature search: a large scale log analysis. InProceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. 7–16

work page 2017
[28]

Daniel Hienert, Dagmar Kern, Katarina Boland, Benjamin Zapilko, and Peter Mutschke. 2019. A digital library for research data and related information in the social sciences. In2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 148–157

work page 2019
[29]

Bruce Croft, Fernando Diaz, Leah S

Nasreen Abdul Jaleel, James Allan, W. Bruce Croft, Fernando Diaz, Leah S. Larkey, Xiaoyan Li, Mark D. Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. InProceedings of the Thirteenth Text REtrieval Conference (NIST Special Publication, Vol. 500-261). National Institute of Standards and Tech- nology (NIST)

work page 2004
[30]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Trans. Inf. Syst.20, 4 (Oct. 2002), 422–446. https://doi. org/10.1145/582415.582418

work page doi:10.1145/582415.582418 2002
[31]

Steve Jones, Sally Jo Cunningham, Rodger McNab, and Stefan Boddie. 2000. A transaction log analysis of a digital library.International Journal on Digital Libraries3, 2 (Aug. 2000), 152–169. https://doi.org/10.1007/s007999900022

work page doi:10.1007/s007999900022 2000
[32]

Herlocker, and Janet Webster

Seikyung Jung, Jonathan L. Herlocker, and Janet Webster. 2007. Click data as implicit relevance feedback in web search.Information Processing & Management 43, 3 (May 2007), 791–807. https://doi.org/10.1016/j.ipm.2006.07.021

work page doi:10.1016/j.ipm.2006.07.021 2007
[33]

Koesten, Luis-Daniel Ibáñez, Elena Simperl, and Jeni Tennison

Emilia Kacprzak, Laura M. Koesten, Luis-Daniel Ibáñez, Elena Simperl, and Jeni Tennison. 2017. A Query Log Analysis of Dataset Search. InWeb Engineering, Jordi Cabot, Roberto De Virgilio, and Riccardo Torlone (Eds.). Springer International Publishing, Cham, 429–436

work page 2017
[34]

Jüri Keller. 2025. Continuous Evaluation in Information Retrieval Across Methods and Time. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 4205. https://doi. org/10.1145/3726302.3730124

work page doi:10.1145/3726302.3730124 2025
[35]

2020.ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Omar Khattab and Matei Zaharia. 2020.ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Association for Computing Machinery, New York, NY, USA, 39–48

work page 2020
[36]

Mi-Young Kim, Juliano Rabelo, Randy Goebel, Masaharu Yoshioka, Yoshinobu Kano, and Ken Satoh. 2023. COLIEE 2022 Summary: Methods for Legal Document Retrieval and Entailment. InNew Frontiers in Artificial Intelligence, Yasufumi SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Mehmet Deniz Türkmen et al. Takama, Katsutoshi Yada, Ken Satoh, and Sachiyo...

work page 2023
[37]

Koesten, Emilia Kacprzak, Jenifer F

Laura M. Koesten, Emilia Kacprzak, Jenifer F. A. Tennison, and Elena Simperl

work page
[38]

InProceedings of the 2017 CHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’17)

The Trials and Tribulations of Working with Structured Data: -a Study on Information Seeking Behaviour. InProceedings of the 2017 CHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’17). Association for Computing Machinery, New York, NY, USA, 1277–1289. https: //doi.org/10.1145/3025453.3025838

work page doi:10.1145/3025453.3025838 2017
[39]

Nikolay Kolyada, Martin Potthast, and Benno Stein. 2025. A Test Collection for Dataset Retrieval. InAdvances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part III(Lucca, Italy). Springer-Verlag, Berlin, Heidelberg, 372–380. https://doi. org/10.1007/978-3-031-88714-7_36

work page doi:10.1007/978-3-031-88714-7_36 2025
[40]

Bruce Croft

Victor Lavrenko and W. Bruce Croft. 2001. Relevance based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(New Orleans, Louisiana, USA)(SIGIR ’01). Association for Computing Machinery, New York, NY, USA, 120–127

work page 2001
[41]

Guoyao Li, Ran He, Shusen Jing, Kayhan Behdin, Yubo Wang, Sundara Raman Ramachandran, Chanh Nguyen, Jian Sheng, Xiaojing Ma, Chuanrui Zhu, Sriram Vasudevan, Muchen Wu, Sayan Ghosh, Lin Su, Qingquan Song, Xiaoqing Wang, Zhipeng Wang, Qing Lan, Yanning Chen, Jingwei Wu, Luke Simon, Wenjing Zhang, Qi Guo, and Fedor Borisyuk. 2026. MixLM: High-Throughput and ...

work page arXiv 2026
[42]

Rensis Likert. 1932. A Technique for the Measurement of Attitudes.Archives of Psychology140 (1932), 1–55

work page 1932
[43]

Yikang Liu, Ziyin Zhang, Wanyang Zhang, Shisen Yue, Xiaojing Zhao, Xinyuan Cheng, Yiwen Zhang, and Hai Hu. 2023. ArguGPT: evaluating, under- standing and identifying argumentative essays generated by GPT models. arXiv:2304.07666 [cs.CL]

work page arXiv 2023
[44]

Huang, Harsha Gurulingappa, Juliane Fluck, Marc Zimmermann, Igor V

Mihai Lupu, Jiashu Zhao, Jimmy X. Huang, Harsha Gurulingappa, Juliane Fluck, Marc Zimmermann, Igor V. Filippov, and John Tait. 2011. Overview of the TREC 2011 Chemical IR Track. InProceedings of The Twentieth Text REtrieval Conference, TREC 2011, Gaithersburg, Maryland, USA, November 15-18, 2011 (NIST Special Publication, Vol. 500-296), Ellen M. Voorhees ...

work page 2011
[45]

Guangyuan Ma, Yongliang Ma, Xuanrui Gou, Zhenpeng Su, Ming Zhou, and Songlin Hu. 2025. LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference.CoRRabs/2505.12260 (2025). https: //doi.org/10.48550/ARXIV.2505.12260 arXiv:2505.12260

work page doi:10.48550/arxiv.2505.12260 2025
[46]

Xueguang Ma, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, and Xilun Chen. 2025. DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 30170–301...

work page 2025
[47]

Manning, Prabhakar Raghavan, and Hinrich Schütze

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008.Intro- duction to Information Retrieval. Cambridge University Press, Cambridge, UK. http://nlp.stanford.edu/IR-book/information-retrieval-book.html

work page 2008
[48]

Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy J. Lin. 2019. Multi- Stage Document Ranking with BERT.ArXivabs/1910.14424 (2019)

work page arXiv 2019
[49]

2015.Question An- swering Track Evaluation in TREC, CLEF and NTCIR

María-Dolores Olvera-Lobo and Juncal Gutiérrez-Artacho. 2015.Question An- swering Track Evaluation in TREC, CLEF and NTCIR. Springer International Publishing, 13–22. https://doi.org/10.1007/978-3-319-16486-1_2

work page doi:10.1007/978-3-319-16486-1_2 2015
[50]

Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. InProceedings of the 1st International Conference on Scalable Information Systems, Infoscale 2006, Hong Kong, May 30-June 1, 2006 (ACM International Conference Proceeding Series, Vol. 152), Xiaohua Jia (Ed.). ACM, 1. https://doi.org/10.1145/ 1146847.1146848

work page arXiv 2006
[51]

2010.Multilingual informa- tion access evaluation I - text retrieval experiments(2010 ed.)

Carol Peters, Giorgio Maria Di Nunzio, Mikko Kurimo, Thomas Mandl, Djamel Mostefa, Anselmo Penas, and Giovanna Roda (Eds.). 2010.Multilingual informa- tion access evaluation I - text retrieval experiments(2010 ed.). Springer, Berlin, Germany

work page 2010
[52]

Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2021. The Expando-Mono- Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models.CoRRabs/2101.05667 (2021)

work page arXiv 2021
[53]

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Ben- dersky. 2024. Large Language Models are Effective Text Rankers with Pair- wise Ranking Prompting. InFindings of the Association for Computational Lin- guistics: NAACL 2024, Kevin Duh, Helena Gomez, and Ste...

work page doi:10.18653/v1/2024.findings-naacl.97 2024
[54]

In: Long, G., Blumestein, M., Chang, Y., Lewin-Eytan, L., Huang, Z.H., Yom-Tov, E

Hossein A. Rahmani, Xi Wang, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, and Paul Thomas. 2025. SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval. InCompanion Proceedings of the ACM on Web Conference 2025(Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, NY, USA, 781–784. https://doi.org/10.1145/3701716.3715311

work page doi:10.1145/3701716.3715311 2025
[55]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 3982–3992

work page 2019
[56]

Stephen Robertson. 2006. On GMAP: and other transformations. InProceedings of the 15th ACM International Conference on Information and Knowledge Management (Arlington, Virginia, USA)(CIKM ’06). Association for Computing Machinery, New York, NY, USA, 78–83. https://doi.org/10.1145/1183614.1183630

work page doi:10.1145/1183614.1183630 2006
[57]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (apr 2009), 333–389

work page 2009
[58]

S. E. Robertson and S. Walker. 1994. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. InSIGIR ’94. Springer London, London, 232–241

work page 1994
[59]

Shaurya Rohatgi, Yanxia Qin, Benjamin Aw, Niranjana Unnithan, and Min- Yen Kan. 2023. The ACL OCL Corpus: Advancing Open Science in Compu- tational Linguistics. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, ...

work page doi:10.18653/v1/2023.emnlp-main.640 2023
[60]

Mark Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems.Foundations and Trends®in Information Retrieval4, 4 (2010), 247–375. https://doi.org/10.1561/1500000009

work page doi:10.1561/1500000009 2010
[61]

Tefko Saracevic. 2008. Effects of Inconsistent Relevance Judgments on Informa- tion Retrieval Test Results: A Historical Perspective.Library Trends56, 4 (March 2008), 763–783. https://doi.org/10.1353/lib.0.0000

work page doi:10.1353/lib.0.0000 2008
[62]

Harrisen Scells, Guido Zuccon, Bevan Koopman, Anthony Deacon, Leif Azzopardi, and Shlomo Geva. 2017. A Test Collection for Evaluating Retrieval of Studies for Inclusion in Systematic Reviews. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval(Shinjuku, Tokyo, Japan)(SIGIR ’17). Association for...

work page doi:10.1145/3077136.3080707 2017
[63]

Councill, Jian Huang, Wang-Chien Lee, and C

Yang Sun, Huajing Li, Isaac G. Councill, Jian Huang, Wang-Chien Lee, and C. Lee Giles. 2008. Personalized ranking for digital libraries based on log analysis. In Proceedings of the 10th ACM Workshop on Web Information and Data Management (Napa Valley, California, USA)(WIDM ’08). Association for Computing Machinery, New York, NY, USA, 133–140. https://doi....

work page doi:10.1145/1458502.1458524 2008
[64]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

work page 2021
[65]

Voorhees and Donna K

Ellen M. Voorhees and Donna K. Harman. 2005.TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing). The MIT Press

work page 2005
[66]

Ruiyi Wang, Haofei Yu, Wenxin Zhang, Zhengyang Qi, Maarten Sap, Yonatan Bisk, Graham Neubig, and Hao Zhu. 2024. SOTOPIA-𝜋: Interactive Learning of Socially Intelligent Language Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Ass...

work page doi:10.18653/v1/2024 2024
[67]

LongEval: Guidelines for human evaluation of faithfulness in long-form summariza- tion

Yu Xia, Junda Wu, Sungchul Kim, Tong Yu, Ryan A. Rossi, Haoliang Wang, and Julian McAuley. 2025. Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1...

work page doi:10.18653/v1/ 2025
[68]

2010.A Search Log-Based Approach to Evaluation

Junte Zhang and Jaap Kamps. 2010.A Search Log-Based Approach to Evaluation. Springer Berlin Heidelberg, 248–260. https://doi.org/10.1007/978-3-642-15464- 5_26

work page doi:10.1007/978-3-642-15464- 2010
[69]

Nimmagadda, Kok Wai Wong, and Torsten Reiners

Dengya Zhu, Shastri L. Nimmagadda, Kok Wai Wong, and Torsten Reiners. 2023. Relevance Judgment Convergence Degree—A Measure of Assessors Inconsistency for Information Retrieval Datasets. Springer International Publishing, Cham, 149–168. https://doi.org/10.1007/978-3-031-32418-5_9

work page doi:10.1007/978-3-031-32418-5_9 2023
[70]

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2025. Large Language Models for Information Retrieval: A Survey.ACM Trans. Inf. Syst.(Sept. 2025). https://doi.org/10.1145/3748304

work page doi:10.1145/3748304 2025
[71]

Justin Zobel. 1998. How reliable are the results of large-scale information re- trieval experiments?. InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(Melbourne, Australia)(SIGIR ’98). Association for Computing Machinery, New York, NY, USA, 307–314. https://doi.org/10.1145/290941.291014

work page doi:10.1145/290941.291014 1998