pith. machine review for the scientific record. sign in

arxiv: 2605.11254 · v1 · submitted 2026-05-11 · 💻 cs.IR

Recognition: no theorem link

MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval

Daniel Hienert, Derek Greene, Dwaipayan Roy, Mehmet Deniz T\"urkmen, Philipp Mayr, Suchana Datta

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:52 UTC · model grok-4.3

classification 💻 cs.IR
keywords information retrieval benchmarkmulti-category retrievalLLM-assisted evaluationsocial science searchcategory-aware rankingtest collectionrelevance assessment
0
0 comments X

The pith

MIRA benchmark uses LLMs to create a test collection for evaluating retrieval across four categories of scholarly items from real user queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern search systems are expected to retrieve seamlessly from diverse data sources and formats in a single interface, yet current IR benchmarks lack test collections that capture this heterogeneity. The paper introduces MIRA to close that gap by building a collection on a real social science platform that spans Publications, Research Data, Variables, and Instruments & Tools. Real user queries form the basis, while a large language model generates topic descriptions, narratives, and relevance judgments to keep construction costs low. This produces a unified framework for testing category-aware ranking. A reader would care because it supplies a ready-made resource for developing and comparing systems that handle mixed scholarly content without forcing separate searches per category.

Core claim

MIRA is a novel benchmark based on a large-scale social science search platform designed for category-aware ranking across heterogeneous categories—Publications, Research Data, Variables, and Instruments & Tools—within a single unified evaluation framework. The collection draws on real user queries, covers items from four distinct categories, and uses a large language model to generate topic descriptions and narratives along with relevance assessments, which substantially reduces the labor and cost of test collection creation.

What carries the argument

LLM-assisted generation of topic descriptions, narratives, and relevance judgments applied to real user queries from a social science platform, producing a multi-category test collection that supports unified evaluation across four heterogeneous item types.

If this is right

  • It supplies a ready resource for testing retrieval systems that must rank items from multiple scholarly categories together.
  • It supports research on category-aware and cross-category ranking algorithms in one evaluation setup.
  • It lowers the cost of creating future multi-category test collections by shifting labor to automated LLM steps.
  • It acts as a foundational testbed for work on multi-faceted, integrated, or heterogeneous information retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LLM-assisted construction process could be replicated in other domains that mix document types, such as legal or medical search.
  • Performance comparisons on MIRA could highlight whether systems benefit from modeling interactions between the four categories rather than treating them independently.
  • If the approach scales, it may shift benchmark creation away from purely manual annotation toward hybrid human-LLM pipelines across the field.
  • Ongoing use of the underlying platform could allow periodic regeneration of the collection with fresh queries to keep the benchmark current.

Load-bearing premise

Large language model outputs for topic descriptions, narratives, and relevance assessments are accurate and unbiased enough to serve as reliable ground truth without detailed human validation.

What would settle it

A side-by-side human review of a random sample of the LLM-assigned relevance labels that shows low agreement rates with expert judgments would demonstrate the benchmark's ground truth is not dependable.

Figures

Figures reproduced from arXiv: 2605.11254 by Daniel Hienert, Derek Greene, Dwaipayan Roy, Mehmet Deniz T\"urkmen, Philipp Mayr, Suchana Datta.

Figure 1
Figure 1. Figure 1: Top-50 topic word cloud from topic modeling [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A topic (‘immigration’) description from the MIRA dataset. further augmented the candidate set using BM25 and ColBERT with a supervised expanded form of the query, thereby incorporating retrieval-based candidates in addition to interaction-derived docu￾ments. A detailed discussion of the pooling techniques employed is provided in the project GitHub repository. Relevance judgments were then automatically ge… view at source ↗
read the original abstract

Users increasingly expect modern search systems to offer a unified interface that seamlessly retrieves information from diverse data sources and formats. However, current information retrieval (IR) evaluation benchmarks have not kept pace with this development, primarily due to the lack of test collections that represent the diversity of contemporary search domains. We address this critical gap with MIRA, a novel benchmark based on a large-scale social science search platform. MIRA is designed for category-aware ranking across heterogeneous categories - Publications, Research Data, Variables, and Instruments & Tools - within a single, unified evaluation framework. The proposed collection is distinctive in several ways: (1) it is built upon real user queries, providing a more realistic basis for evaluation; (2) it covers scholarly items from four distinct categories, enabling multi-faceted evaluation; and (3) it leverages a Large Language Model to generate topic descriptions and narratives, as well as for relevance assessment with respect to these topics, substantially reducing the labor and cost of test collection generation. We release this resource to benefit the community by providing a foundational testbed for the research on multi-faceted, category-aware, integrated, or cross-category information retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents MIRA, a new IR benchmark for multi-category integrated retrieval constructed from real user queries on a large-scale social science search platform. It covers four heterogeneous categories (Publications, Research Data, Variables, and Instruments & Tools) within a unified framework, using an LLM to generate topic descriptions, narratives, and relevance assessments in order to reduce the cost of test collection creation while enabling category-aware ranking evaluation.

Significance. If the LLM-generated labels can be shown to be sufficiently accurate, MIRA would provide a valuable, realistic testbed that addresses the current lack of benchmarks supporting unified evaluation across diverse scholarly data types. This could meaningfully advance research on category-aware and integrated retrieval systems that match modern user expectations for seamless multi-source search.

major comments (1)
  1. [§4 and §5] §4 (Benchmark Construction) and §5 (Relevance Assessment): The high-level description of LLM-assisted topic generation and relevance labeling is provided, but no human validation, inter-annotator agreement statistics, error analysis, or category-specific calibration results are reported. Because the central claim is that MIRA supplies a usable, high-quality ground truth for cross-category ranking, the absence of these checks leaves open the possibility of systematic LLM biases (e.g., favoring textual over structured items) that would invalidate comparative metrics.
minor comments (2)
  1. [Abstract and §7] The release statement in the abstract and conclusion should include a persistent identifier (e.g., DOI or GitHub release tag) and explicit licensing terms for the queries, topics, and judgments.
  2. [Table 1] Table 1 (category statistics) would benefit from an additional column reporting the number of unique queries per category to clarify the balance of the test collection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. The feedback highlights an important aspect of benchmark quality that we will address in the revision.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Benchmark Construction) and §5 (Relevance Assessment): The high-level description of LLM-assisted topic generation and relevance labeling is provided, but no human validation, inter-annotator agreement statistics, error analysis, or category-specific calibration results are reported. Because the central claim is that MIRA supplies a usable, high-quality ground truth for cross-category ranking, the absence of these checks leaves open the possibility of systematic LLM biases (e.g., favoring textual over structured items) that would invalidate comparative metrics.

    Authors: We agree that the absence of reported human validation leaves the quality of the LLM-generated labels insufficiently substantiated for a benchmark paper. The current manuscript emphasizes the construction pipeline and cost-reduction benefits but does not include the requested checks. In the revised version we will add a dedicated validation section (new §5.3) that reports: (i) human evaluation of a stratified sample of 200 topics and 1,000 relevance judgments by two independent annotators, (ii) inter-annotator agreement statistics (Cohen’s κ and percentage agreement) broken down by category, (iii) a qualitative error analysis of LLM–human disagreements, and (iv) per-category accuracy and bias diagnostics to test for systematic favoritism toward textual versus structured items. We will also release the annotation guidelines and sampled judgments as supplementary material. These additions will directly address the concern that unverified labels could invalidate cross-category comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: MIRA is a new resource creation with no derivation chain

full rationale

The paper introduces MIRA as a benchmark built from real user queries on a social science platform, using LLM assistance to generate topic descriptions, narratives, and relevance labels across four categories. No equations, fitted parameters, predictions, or self-referential derivations appear in the abstract or described content. The construction is presented as an independent methodological contribution for multi-category IR evaluation, with no load-bearing steps that reduce to prior outputs by definition or self-citation. Claims rest on the novelty of the collection itself rather than any tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the representativeness of one social science platform and the adequacy of LLM labeling; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption LLM-generated relevance judgments can substitute for human annotations in creating a reliable multi-category test collection
    Invoked when stating that LLM use substantially reduces labor and cost while maintaining quality.

pith-pipeline@v0.9.0 · 5521 in / 1203 out tokens · 43854 ms · 2026-05-13T01:52:12.765120+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 1 internal anchor

  1. [1]

    [n. d.]. CLEF. https://www.clef-initiative.eu/. Accessed: 2026-02-12

  2. [2]

    [n. d.]. FIRE. https://fire.irsi.org.in/fire/2025/home. Accessed: 2026-02-12

  3. [3]

    [n. d.]. GESIS Search. https://search.gesis.org/. Accessed: 2026-02-12

  4. [4]

    [n. d.]. NTCIR. https://research.nii.ac.jp/ntcir/index-en.html. Accessed: 2026-02- 12

  5. [5]

    [n. d.]. TREC. https://trec.nist.gov/. Accessed: 2026-02-12

  6. [6]

    Omar Alonso and Stefano Mizzaro. 2012. Using crowdsourcing for TREC rel- evance assessment.Inf. Process. Manage.48, 6 (Nov. 2012), 1053–1066. https: //doi.org/10.1016/j.ipm.2012.01.004

  7. [7]

    Negar Arabzadeh and Charles L. A. Clarke. 2025. Benchmarking LLM-based Relevance Judgment Methods. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy) (SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 3194–3204. https://doi.org/10.1145/3726302.3730305

  8. [8]

    Debanjali Biswas, Endri Gupta, Ran Yu, and Benjamin Zapilko. 2025. GESIS Knowledge Graph (GESIS KG). (Version 2.0.0) [Data set]. GESIS, Cologne. https: //doi.org/10.7802/2969

  9. [9]

    Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. 2022. InPars: Unsupervised Dataset Generation for Information Retrieval. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval(Madrid, Spain)(SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2387–2392. https:/...

  10. [10]

    Ricardo J. G. B. Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. InProceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2013), Part II (Lecture Notes in Computer Science, Vol. 7819), Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda, and Guan...

  11. [11]

    Adriane Chapman, Elena Simperl, Laura Koesten, George Konstantinidis, Luis- Daniel Ibáñez, Emilia Kacprzak, and Paul Groth. 2019. Dataset search: a survey. The VLDB Journal29, 1 (Aug. 2019), 251–272. https://doi.org/10.1007/s00778- 019-00564-x

  12. [12]

    J Chen, N Wang, C Li, B Wang, S Xiao, H Xiao, H Liao, D Lian, and Z Liu. [n. d.]. Air-bench: Automated heterogeneous information retrieval benchmark (2024)

  13. [13]

    1997.The Cranfield tests on index language devices

    Cyril Cleverdon. 1997.The Cranfield tests on index language devices. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 47–59

  14. [14]

    Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological Bulletin70, 4 (1968), 213–220. https://doi.org/10.1037/h0026256

  15. [15]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin

  16. [16]

    InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21)

    MS MARCO: Benchmarking Ranking Models in the Large-Data Regime. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). As- sociation for Computing Machinery, New York, NY, USA, 1566–1576. https: //doi.org/10.1145/3404835.3462804

  17. [17]

    In: Diaz, F., Shah, C., Suel, T., Castells, P., Jones, R., Sakai, T

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees, and Ian Soboroff. 2021. TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Ma...

  18. [18]

    Laura Dietz, Manisha Verma, Filip Radlinski, and Nick Craswell. 2017. TREC Complex Answer Retrieval Overview. InProceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, November 15-17, 2017 (NIST Special Publication, Vol. 500-324), Ellen M. Voorhees and Angela Ellis (Eds.). National Institute of Standards and Tech...

  19. [19]

    Kuicai Dong, Yujing Chang, Derrick Goh Xin Deik, Dexun Li, Ruiming Tang, and Yong Liu. 2025. MMDocIR: Benchmarking Multimodal Retrieval for Long Documents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 30959–30993

  20. [20]

    Alessandro Epasto, Jon Feldman, Silvio Lattanzi, Stefano Leonardi, and Vahab Mirrokni. 2014. Reduce and aggregate: similarity ranking in multi-categorical bipartite graphs. InProceedings of the 23rd International Conference on World Wide Web(Seoul, Korea)(WWW ’14). Association for Computing Machinery, New York, NY, USA, 349–360. https://doi.org/10.1145/25...

  21. [21]

    Guglielmo Faggioli, Laura Dietz, Charles L. A. Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, and Henning Wachsmuth. 2023. Perspectives on Large Language Models for Relevance Judgment. InProceedings of the 2023 ACM SI- GIR International Conference on Theory of Information Retrieva...

  22. [22]

    Naghmeh Farzi and Laura Dietz. 2025. Criteria-Based LLM Relevance Judgments. InProceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)(Padua, Italy)(ICTIR ’25). Association for Computing Machinery, New York, NY, USA, 254–263

  23. [23]

    Yaming Fu, Elizabeth Lomas, and Charles Inskip. 2021. Library log analysis and its implications for studying online information seeking behavior of cultural groups.The Journal of Academic Librarianship47, 5 (Sept. 2021), 102421. https: //doi.org/10.1016/j.acalib.2021.102421

  24. [24]

    Mingqi Gao, Xinyu Hu, Xunjian Yin, Jie Ruan, Xiao Pu, and Xiaojun Wan

  25. [25]

    https://doi.org/10.1162/coli_a_00561 arXiv:https://direct.mit.edu/coli/article-pdf/51/2/661/2513169/coli_a_00561.pdf

    LLM-based NLG Evaluation: Current Status and Challenges.Compu- tational Linguistics51, 2 (06 2025), 661–687. https://doi.org/10.1162/coli_a_00561 arXiv:https://direct.mit.edu/coli/article-pdf/51/2/661/2513169/coli_a_00561.pdf

  26. [26]

    Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794 [cs.CL] https://arxiv.org/abs/2203.05794

  27. [27]

    Daniel Hienert. 2017. User interests in German social science literature search: a large scale log analysis. InProceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval. 7–16

  28. [28]

    Daniel Hienert, Dagmar Kern, Katarina Boland, Benjamin Zapilko, and Peter Mutschke. 2019. A digital library for research data and related information in the social sciences. In2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 148–157

  29. [29]

    Bruce Croft, Fernando Diaz, Leah S

    Nasreen Abdul Jaleel, James Allan, W. Bruce Croft, Fernando Diaz, Leah S. Larkey, Xiaoyan Li, Mark D. Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. InProceedings of the Thirteenth Text REtrieval Conference (NIST Special Publication, Vol. 500-261). National Institute of Standards and Tech- nology (NIST)

  30. [30]

    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Trans. Inf. Syst.20, 4 (Oct. 2002), 422–446. https://doi. org/10.1145/582415.582418

  31. [31]

    Steve Jones, Sally Jo Cunningham, Rodger McNab, and Stefan Boddie. 2000. A transaction log analysis of a digital library.International Journal on Digital Libraries3, 2 (Aug. 2000), 152–169. https://doi.org/10.1007/s007999900022

  32. [32]

    Herlocker, and Janet Webster

    Seikyung Jung, Jonathan L. Herlocker, and Janet Webster. 2007. Click data as implicit relevance feedback in web search.Information Processing & Management 43, 3 (May 2007), 791–807. https://doi.org/10.1016/j.ipm.2006.07.021

  33. [33]

    Koesten, Luis-Daniel Ibáñez, Elena Simperl, and Jeni Tennison

    Emilia Kacprzak, Laura M. Koesten, Luis-Daniel Ibáñez, Elena Simperl, and Jeni Tennison. 2017. A Query Log Analysis of Dataset Search. InWeb Engineering, Jordi Cabot, Roberto De Virgilio, and Riccardo Torlone (Eds.). Springer International Publishing, Cham, 429–436

  34. [34]

    Jüri Keller. 2025. Continuous Evaluation in Information Retrieval Across Methods and Time. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 4205. https://doi. org/10.1145/3726302.3730124

  35. [35]

    2020.ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

    Omar Khattab and Matei Zaharia. 2020.ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Association for Computing Machinery, New York, NY, USA, 39–48

  36. [36]

    Mi-Young Kim, Juliano Rabelo, Randy Goebel, Masaharu Yoshioka, Yoshinobu Kano, and Ken Satoh. 2023. COLIEE 2022 Summary: Methods for Legal Document Retrieval and Entailment. InNew Frontiers in Artificial Intelligence, Yasufumi SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Mehmet Deniz Türkmen et al. Takama, Katsutoshi Yada, Ken Satoh, and Sachiyo...

  37. [37]

    Koesten, Emilia Kacprzak, Jenifer F

    Laura M. Koesten, Emilia Kacprzak, Jenifer F. A. Tennison, and Elena Simperl

  38. [38]

    InProceedings of the 2017 CHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’17)

    The Trials and Tribulations of Working with Structured Data: -a Study on Information Seeking Behaviour. InProceedings of the 2017 CHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’17). Association for Computing Machinery, New York, NY, USA, 1277–1289. https: //doi.org/10.1145/3025453.3025838

  39. [39]

    Nikolay Kolyada, Martin Potthast, and Benno Stein. 2025. A Test Collection for Dataset Retrieval. InAdvances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part III(Lucca, Italy). Springer-Verlag, Berlin, Heidelberg, 372–380. https://doi. org/10.1007/978-3-031-88714-7_36

  40. [40]

    Bruce Croft

    Victor Lavrenko and W. Bruce Croft. 2001. Relevance based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(New Orleans, Louisiana, USA)(SIGIR ’01). Association for Computing Machinery, New York, NY, USA, 120–127

  41. [41]

    Guoyao Li, Ran He, Shusen Jing, Kayhan Behdin, Yubo Wang, Sundara Raman Ramachandran, Chanh Nguyen, Jian Sheng, Xiaojing Ma, Chuanrui Zhu, Sriram Vasudevan, Muchen Wu, Sayan Ghosh, Lin Su, Qingquan Song, Xiaoqing Wang, Zhipeng Wang, Qing Lan, Yanning Chen, Jingwei Wu, Luke Simon, Wenjing Zhang, Qi Guo, and Fedor Borisyuk. 2026. MixLM: High-Throughput and ...

  42. [42]

    Rensis Likert. 1932. A Technique for the Measurement of Attitudes.Archives of Psychology140 (1932), 1–55

  43. [43]

    Yikang Liu, Ziyin Zhang, Wanyang Zhang, Shisen Yue, Xiaojing Zhao, Xinyuan Cheng, Yiwen Zhang, and Hai Hu. 2023. ArguGPT: evaluating, under- standing and identifying argumentative essays generated by GPT models. arXiv:2304.07666 [cs.CL]

  44. [44]

    Huang, Harsha Gurulingappa, Juliane Fluck, Marc Zimmermann, Igor V

    Mihai Lupu, Jiashu Zhao, Jimmy X. Huang, Harsha Gurulingappa, Juliane Fluck, Marc Zimmermann, Igor V. Filippov, and John Tait. 2011. Overview of the TREC 2011 Chemical IR Track. InProceedings of The Twentieth Text REtrieval Conference, TREC 2011, Gaithersburg, Maryland, USA, November 15-18, 2011 (NIST Special Publication, Vol. 500-296), Ellen M. Voorhees ...

  45. [45]

    Guangyuan Ma, Yongliang Ma, Xuanrui Gou, Zhenpeng Su, Ming Zhou, and Songlin Hu. 2025. LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference.CoRRabs/2505.12260 (2025). https: //doi.org/10.48550/ARXIV.2505.12260 arXiv:2505.12260

  46. [46]

    Xueguang Ma, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, and Xilun Chen. 2025. DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 30170–301...

  47. [47]

    Manning, Prabhakar Raghavan, and Hinrich Schütze

    Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008.Intro- duction to Information Retrieval. Cambridge University Press, Cambridge, UK. http://nlp.stanford.edu/IR-book/information-retrieval-book.html

  48. [48]

    Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy J. Lin. 2019. Multi- Stage Document Ranking with BERT.ArXivabs/1910.14424 (2019)

  49. [49]

    2015.Question An- swering Track Evaluation in TREC, CLEF and NTCIR

    María-Dolores Olvera-Lobo and Juncal Gutiérrez-Artacho. 2015.Question An- swering Track Evaluation in TREC, CLEF and NTCIR. Springer International Publishing, 13–22. https://doi.org/10.1007/978-3-319-16486-1_2

  50. [50]

    Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. InProceedings of the 1st International Conference on Scalable Information Systems, Infoscale 2006, Hong Kong, May 30-June 1, 2006 (ACM International Conference Proceeding Series, Vol. 152), Xiaohua Jia (Ed.). ACM, 1. https://doi.org/10.1145/ 1146847.1146848

  51. [51]

    2010.Multilingual informa- tion access evaluation I - text retrieval experiments(2010 ed.)

    Carol Peters, Giorgio Maria Di Nunzio, Mikko Kurimo, Thomas Mandl, Djamel Mostefa, Anselmo Penas, and Giovanna Roda (Eds.). 2010.Multilingual informa- tion access evaluation I - text retrieval experiments(2010 ed.). Springer, Berlin, Germany

  52. [52]

    Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2021. The Expando-Mono- Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models.CoRRabs/2101.05667 (2021)

  53. [53]

    Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Ben- dersky. 2024. Large Language Models are Effective Text Rankers with Pair- wise Ranking Prompting. InFindings of the Association for Computational Lin- guistics: NAACL 2024, Kevin Duh, Helena Gomez, and Ste...

  54. [54]

    In: Long, G., Blumestein, M., Chang, Y., Lewin-Eytan, L., Huang, Z.H., Yom-Tov, E

    Hossein A. Rahmani, Xi Wang, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, and Paul Thomas. 2025. SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval. InCompanion Proceedings of the ACM on Web Conference 2025(Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, NY, USA, 781–784. https://doi.org/10.1145/3701716.3715311

  55. [55]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 3982–3992

  56. [56]

    Stephen Robertson. 2006. On GMAP: and other transformations. InProceedings of the 15th ACM International Conference on Information and Knowledge Management (Arlington, Virginia, USA)(CIKM ’06). Association for Computing Machinery, New York, NY, USA, 78–83. https://doi.org/10.1145/1183614.1183630

  57. [57]

    Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (apr 2009), 333–389

  58. [58]

    S. E. Robertson and S. Walker. 1994. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. InSIGIR ’94. Springer London, London, 232–241

  59. [59]

    Shaurya Rohatgi, Yanxia Qin, Benjamin Aw, Niranjana Unnithan, and Min- Yen Kan. 2023. The ACL OCL Corpus: Advancing Open Science in Compu- tational Linguistics. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, ...

  60. [60]

    Mark Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems.Foundations and Trends®in Information Retrieval4, 4 (2010), 247–375. https://doi.org/10.1561/1500000009

  61. [61]

    Tefko Saracevic. 2008. Effects of Inconsistent Relevance Judgments on Informa- tion Retrieval Test Results: A Historical Perspective.Library Trends56, 4 (March 2008), 763–783. https://doi.org/10.1353/lib.0.0000

  62. [62]

    Harrisen Scells, Guido Zuccon, Bevan Koopman, Anthony Deacon, Leif Azzopardi, and Shlomo Geva. 2017. A Test Collection for Evaluating Retrieval of Studies for Inclusion in Systematic Reviews. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval(Shinjuku, Tokyo, Japan)(SIGIR ’17). Association for...

  63. [63]

    Councill, Jian Huang, Wang-Chien Lee, and C

    Yang Sun, Huajing Li, Isaac G. Councill, Jian Huang, Wang-Chien Lee, and C. Lee Giles. 2008. Personalized ranking for digital libraries based on log analysis. In Proceedings of the 10th ACM Workshop on Web Information and Data Management (Napa Valley, California, USA)(WIDM ’08). Association for Computing Machinery, New York, NY, USA, 133–140. https://doi....

  64. [64]

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  65. [65]

    Voorhees and Donna K

    Ellen M. Voorhees and Donna K. Harman. 2005.TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing). The MIT Press

  66. [66]

    Ruiyi Wang, Haofei Yu, Wenxin Zhang, Zhengyang Qi, Maarten Sap, Yonatan Bisk, Graham Neubig, and Hao Zhu. 2024. SOTOPIA-𝜋: Interactive Learning of Socially Intelligent Language Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Ass...

  67. [67]

    LongEval: Guidelines for human evaluation of faithfulness in long-form summariza- tion

    Yu Xia, Junda Wu, Sungchul Kim, Tong Yu, Ryan A. Rossi, Haoliang Wang, and Julian McAuley. 2025. Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1...

  68. [68]

    2010.A Search Log-Based Approach to Evaluation

    Junte Zhang and Jaap Kamps. 2010.A Search Log-Based Approach to Evaluation. Springer Berlin Heidelberg, 248–260. https://doi.org/10.1007/978-3-642-15464- 5_26

  69. [69]

    Nimmagadda, Kok Wai Wong, and Torsten Reiners

    Dengya Zhu, Shastri L. Nimmagadda, Kok Wai Wong, and Torsten Reiners. 2023. Relevance Judgment Convergence Degree—A Measure of Assessors Inconsistency for Information Retrieval Datasets. Springer International Publishing, Cham, 149–168. https://doi.org/10.1007/978-3-031-32418-5_9

  70. [70]

    Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2025. Large Language Models for Information Retrieval: A Survey.ACM Trans. Inf. Syst.(Sept. 2025). https://doi.org/10.1145/3748304

  71. [71]

    Justin Zobel. 1998. How reliable are the results of large-scale information re- trieval experiments?. InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(Melbourne, Australia)(SIGIR ’98). Association for Computing Machinery, New York, NY, USA, 307–314. https://doi.org/10.1145/290941.291014