arxiv: 2604.22864 · v1 · submitted 2026-04-23 · 💻 cs.IR · cs.CL

Recognition: unknown

A Large-Scale, Cross-Disciplinary Corpus of Systematic Reviews

Arno Simons, Harrisen Scells, Martin Potthast, Pierre Achkar, Tim Gollub

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:53 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords systematic reviewscorpuscross-disciplinaryinformation retrievalmeta-scienceevidence synthesisbenchmarkingOpenAlex

0 comments

The pith

A corpus of 301,871 systematic reviews across all scientific fields enables broad benchmarking and meta-analysis of evidence synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to overcome the scale and coverage limits of prior benchmarks for systematic reviewing, most of which stay small or stay inside biomedicine. It assembles Webis-SR4ALL-26 by harvesting reviews from OpenAlex and running them through a pipeline that attaches metadata, reference lists, and extracted method details. When present, search strategies are turned into normalized, runnable forms and inclusion criteria are pulled out as structured text. The resulting layers let researchers test retrieval and screening systems at scale against real review reference sets and compare review practices by field and year. Releasing the corpus, pipeline, and validation code supplies a shared foundation for work on automated evidence synthesis across disciplines.

Core claim

We present Webis-SR4ALL-26, a large-scale, cross-disciplinary corpus of 301,871 systematic reviews spanning all scientific fields as covered by OpenAlex. Using a multi-stage pre-processing pipeline, we link reviews to resolved OpenAlex metadata and reference lists and extract, when explicitly reported, structured method artifacts relevant to retrieval and screening. These artifacts include reported search strategies (Boolean queries or keyword lists) that we normalize into executable approximations, as well as reported inclusion and exclusion criteria. Together, these layers support cross-domain benchmarking of retrieval and screening components against review reference lists, training and评价

What carries the argument

Webis-SR4ALL-26 corpus produced by a multi-stage pre-processing pipeline that links reviews to OpenAlex metadata and extracts normalized search strategies plus inclusion criteria

If this is right

Cross-domain benchmarking of retrieval and screening components against resolved reference lists becomes possible at large scale.
Methods for extracting review artifacts such as search queries and criteria can be trained and evaluated on data from many fields.
Comparative meta-science analyses of systematic review practices can be conducted across disciplines and across time.
Baseline retrieval performance can be measured by executing the normalized search strategies inside OpenAlex and comparing results to reference lists.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The normalized search strategies could be reused as starting points for improving domain-specific academic search systems.
Longitudinal updates to the corpus would allow tracking of changes in how reviews are conducted as publishing practices evolve.
Differences in the frequency and structure of reported inclusion criteria between hard sciences and social sciences could be measured directly.

Load-bearing premise

The multi-stage pre-processing pipeline correctly links reviews to OpenAlex metadata and accurately extracts and normalizes search strategies and inclusion criteria with acceptable error rates across all disciplines.

What would settle it

Manual verification on a random sample of non-biomedical reviews showing that more than 20 percent of extracted search strategies or inclusion criteria do not match the original paper text would falsify the corpus quality claim.

Figures

Figures reproduced from arXiv: 2604.22864 by Arno Simons, Harrisen Scells, Martin Potthast, Pierre Achkar, Tim Gollub.

**Figure 1.** Figure 1: Grounded extraction of review key information. Verbatim source spans from the PDF are transformed into structured [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

read the original abstract

Existing benchmarks for systematic reviewing remain limited either in scale or in disciplinary coverage, with some collections comprising only a modest number of topics and others focusing primarily on biomedical research. We present Webis-SR4ALL-26, a large-scale, cross-disciplinary corpus of 301,871 systematic reviews spanning all scientific fields as covered by OpenAlex. Using a multi-stage pre-processing pipeline, we link reviews to resolved OpenAlex metadata and reference lists and extract, when explicitly reported, structured method artifacts relevant to retrieval and screening. These artifacts include reported search strategies (Boolean queries or keyword lists) that we normalize into executable approximations, as well as reported inclusion and exclusion criteria. Together, these layers support cross-domain benchmarking of retrieval and screening components against review reference lists, training and evaluation of extraction methods for review artifacts, and comparative meta-science analyses of systematic review practices across disciplines and time. To demonstrate one concrete use case, we report large-scale baseline retrieval signals by executing normalized search strategies in OpenAlex and comparing retrieved sets to resolved reference lists. We release the corpus and the pre-processing pipeline, along with code used for extraction validation and the retrieval demonstration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A large new corpus of systematic reviews across disciplines, but with thin validation on the extraction pipeline.

read the letter

The paper releases Webis-SR4ALL-26, a corpus of 301,871 systematic reviews spanning all fields covered by OpenAlex, with extracted normalized search strategies and inclusion criteria. The size and disciplinary breadth are the main things to take away. They use a pipeline to connect reviews to OpenAlex metadata and references, turn reported search strategies into normalized executable approximations, and pull out inclusion and exclusion criteria when present. They show a use case by running the normalized queries and matching against the reference lists. Releasing the pre-processing pipeline, validation code, and the corpus itself makes it usable for others. This is new in its scale and coverage compared to prior smaller or biomedical-focused sets. It opens up cross-domain benchmarking for retrieval and screening in systematic reviews, plus training for extraction methods and meta-science comparisons across fields. The weak part is the lack of reported accuracy metrics. The pipeline claims to filter correctly, resolve metadata, and normalize artifacts, but the paper provides no precision or recall figures, no discipline-specific breakdowns, and no error analysis for areas with different reporting norms. This leaves open whether the reference lists are reliable enough for the benchmarking goals, especially outside biomedicine. The stress-test concern about unquantified performance is fair based on what's here. This work is for information retrieval researchers evaluating systems on systematic review tasks and meta-scientists studying review methods. People needing large test sets for cross-disciplinary experiments will find it practical. It deserves peer review. The resource has enough substance to justify referee time, particularly to strengthen the validation side.

Referee Report

1 major / 1 minor

Summary. The manuscript presents Webis-SR4ALL-26, a corpus of 301,871 systematic reviews spanning all fields in OpenAlex. A multi-stage pipeline links records to resolved metadata and reference lists, extracts reported search strategies (normalized to executable approximations) and inclusion/exclusion criteria, and supports cross-domain benchmarking; the authors demonstrate one use case by executing the normalized queries in OpenAlex and comparing results to reference lists. The corpus, pipeline code, and validation code are released publicly.

Significance. If the pipeline's identification and extraction steps prove reliable, the resource would meaningfully advance IR and meta-science by supplying the first large-scale, cross-disciplinary testbed for retrieval and screening evaluation, moving beyond small or biomedicine-only collections. Releasing the full pipeline and validation code is a clear strength that enables community inspection and extension.

major comments (1)

[Section 3] Section 3 (Corpus Construction) and the validation subsection: the central claim that the corpus supports reliable cross-disciplinary benchmarking rests on the multi-stage pipeline correctly filtering systematic reviews and extracting/normalizing artifacts with acceptable error rates, yet no per-discipline precision/recall figures, confusion-matrix breakdowns, or error analysis for non-biomedical fields are reported. This is load-bearing because reporting conventions differ substantially outside biomedicine, directly affecting the trustworthiness of the reference lists and baseline signals.

minor comments (1)

[Section 5] The demonstration of baseline retrieval signals would be clearer if accompanied by a table summarizing aggregate metrics (e.g., recall@K across disciplines) rather than only high-level statements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the corpus and its potential contributions to IR and meta-science. We address the single major comment below.

read point-by-point responses

Referee: [Section 3] Section 3 (Corpus Construction) and the validation subsection: the central claim that the corpus supports reliable cross-disciplinary benchmarking rests on the multi-stage pipeline correctly filtering systematic reviews and extracting/normalizing artifacts with acceptable error rates, yet no per-discipline precision/recall figures, confusion-matrix breakdowns, or error analysis for non-biomedical fields are reported. This is load-bearing because reporting conventions differ substantially outside biomedicine, directly affecting the trustworthiness of the reference lists and baseline signals.

Authors: We agree that the lack of per-discipline validation metrics is a substantive limitation for claims about reliable cross-disciplinary use, given known differences in reporting conventions. The manuscript reports only aggregate precision/recall for the filtering and extraction stages, based on a validation sample that was intended to span disciplines but without field-level breakdowns. In the revised version we will add to Section 3 a table of per-discipline metrics (precision, recall, and confusion-matrix summaries) for the major OpenAlex fields represented in the validation sample, together with a short discussion of sample-size limitations and the impact of varying conventions on reference-list quality. The released validation code will be updated with scripts to reproduce these stratified results. revision: yes

Circularity Check

0 steps flagged

No circularity: standard data-resource paper with no derivations or fitted predictions

full rationale

The paper presents a corpus (Webis-SR4ALL-26) built from OpenAlex via a multi-stage pipeline that links records, extracts artifacts, and normalizes queries. No equations, predictions, or first-principles results are claimed; the central contribution is the released dataset and code. Validation is mentioned as released but is not used to derive any quantity that reduces to the pipeline inputs by construction. No self-citation load-bearing, uniqueness theorems, or ansatz smuggling appear. The work is self-contained as a descriptive resource release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data collection and release paper containing no mathematical models, fitted parameters, or postulated entities. It relies on standard assumptions about data cleaning, entity resolution from OpenAlex, and text extraction.

pith-pipeline@v0.9.0 · 5507 in / 1111 out tokens · 46874 ms · 2026-05-09T20:53:38.399853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 16 canonical work pages

[1]

2012.An Introduction to Systematic Reviews(London ; Thousand Oaks, Calif, 2012)

David Gough, Sandy Oliver, and James Thomas (Eds.). 2012.An Introduction to Systematic Reviews(London ; Thousand Oaks, Calif, 2012). Sage Publications Ltd

2012
[2]

Amal Alharbi and Mark Stevenson. 2019. A Dataset of Systematic Review Updates. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1257–126...

work page doi:10.1145/3331184.3331358 2019
[3]

Joe Barrow, Raj Patel, Misha Kharkovski, Ben Davies, and Ryan Schmitt. 2025. SafePassage: High-Fidelity Information Extraction with Black Box LLMs.CoRR abs/2510.00276 (2025). arXiv:2510.00276 doi:10.48550/ARXIV.2510.00276

work page doi:10.48550/arxiv.2510.00276 2025
[4]

2012.How to do a systematic literature review in nursing: a step-by-step guide: A Step-By-Step Guide

Bettany-saltikov. 2012.How to do a systematic literature review in nursing: a step-by-step guide: A Step-By-Step Guide. Open University Press

2012
[5]

Clemens Blümel and Alexander Schniedermann. 2020. Studying review articles in scientometrics and beyond: a research agenda. 124, 1 (2020), 711–728. doi:10. 1007/s11192-020-03431-7

2020
[6]

Iain Chalmers. 1993. The Cochrane collaboration: preparing, maintaining, and disseminating systematic reviews of the effects of health care. 703, 1 (1993), 156–165

1993
[7]

Page, and VJHW Welch

Jacqueline Chandler, Miranda Cumpston, Tianjing Li, Matthew J. Page, and VJHW Welch. 2019. Cochrane handbook for systematic reviews of interven- tions. 4, 1002 (2019), 14651858. https://dariososafoula.wordpress.com/wp- content/uploads/2017/01/cochrane-handbook-for-systematic-reviews-of- interventions-2019-1.pdf

2019
[8]

Reducing workload in systematic review preparation using automated citation classifica- tion

Aaron M. Cohen, William R. Hersh, K. Peterson, and Po-Yin Yen. 2006. Research Paper: Reducing Workload in Systematic Review Preparation Using Automated Citation Classification.J. Am. Medical Informatics Assoc.13, 2 (2006), 206–219. doi:10.1197/JAMIA.M1929

work page doi:10.1197/jamia.m1929 2006
[9]

Scientometrics , volume=

Jack H. Culbert, Anne Hobert, Najko Jahn, Nick Haupka, Marion Schmidt, Paul Donner, and Philipp Mayr. 2025. Reference coverage analysis of OpenAlex compared to Web of Science and Scopus.Scientometrics130 (2025), 2475–2492. doi:10.1007/s11192-025-05293-3

work page doi:10.1007/s11192-025-05293-3 2025
[10]

Gordon Guyatt, John Cairns, David Churchill, Deborah Cook, Brian Haynes, Jack Hirsh, Jan Irvine, Mark Levine, Mitchell Levine, and Jim Nishikawa. 1992. Evidence-based medicine: a new approach to teaching the practice of medicine. 268, 17 (1992), 2420–2425

1992
[11]

Abdelhakim Hannousse and Salima Yahiouche. 2021. A Semi-automatic Docu- ment Screening System for Computer Science Systematic Reviews. InPattern Recognition and Artificial Intelligence - 5th Mediterranean Conference, MedPRAI 2021, Istanbul, Turkey, December 17-18, 2021, Proceedings (Communications in Computer and Information Science, Vol. 1543), Chawki Dj...

work page doi:10.1007/978-3-031-04112-9_15 2021
[12]

Howard, Jason Phillips, Kyle Miller, Arpit Tandon, Deepak Mav, Mihir R

Brian E. Howard, Jason Phillips, Kyle Miller, Arpit Tandon, Deepak Mav, Mihir R. Shah, Stephanie Holmgren, Katherine E. Pelch, Vickie Walker, Andrew A. Rooney, Malcolm Macleod, Ruchir R. Shah, and Kristina Thayer. 2016. SWIFT-Review: a text-mining workbench for systematic review.Systematic Reviews5 (2016), 87. doi:10.1186/s13643-016-0263-z

work page doi:10.1186/s13643-016-0263-z 2016
[13]

Evangelos Kanoulas, Dan Li, Leif Azzopardi, and René Spijker. 2017. CLEF 2017 Technologically Assisted Reviews in Empirical Medicine Overview. InWorking Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017 (CEUR Workshop Proceedings, Vol. 1866), Linda Cappellato, Nicola Ferro, Lorraine Goeuriot, and Thom...

2017
[14]

Evangelos Kanoulas, Dan Li, Leif Azzopardi, and René Spijker. 2018. CLEF 2018 Technologically Assisted Reviews in Empirical Medicine Overview. InWorking Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, A vignon, France, September 10-14, 2018 (CEUR Workshop Proceedings, Vol. 2125), Linda Cappellato, Nicola Ferro, Jian-Yun Nie, and Laure So...

2018
[15]

Evangelos Kanoulas, Dan Li, Leif Azzopardi, and René Spijker. 2019. CLEF 2019 Technology Assisted Reviews in Empirical Medicine Overview. InWorking Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019 (CEUR Workshop Proceedings, Vol. 2380), Linda Cappellato, Nicola Ferro, David E. Losada, and Henning ...

2019
[16]

Barbara Kitchenham. 2004. Procedures for performing systematic reviews.Keele, UK, Keele University33, 2004 (2004), 1–26

2004
[17]

Mendoza, Matthias Samwald, Petr Knoth, and Allan Hanbury

Wojciech Kusa, Óscar E. Mendoza, Matthias Samwald, Petr Knoth, and Allan Hanbury. 2023. CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 -...

2023
[18]

Guillaume Lame. 2019. Systematic Literature Reviews: An Introduction.Pro- ceedings of the Design Society: International Conference on Engineering Design1, 1 (2019), 1633–1642. doi:10.1017/dsi.2019.169

work page doi:10.1017/dsi.2019.169 2019
[19]

Zhijing Li, Yunwen Yu, Wenhao Gu, Tiantian Zhu, Haohua Song, Wenbin Guo, Xiao Yang, and Zexuan Zhu. 2025. Dual-LLM Adversarial Frame- work for Information Extraction from Research Literature.bioRxiv(2025). arXiv:https://www.biorxiv.org/content/early/2025/09/16/2025.09.11.675507.full.pdf doi:10.1101/2025.09.11.675507

work page doi:10.1101/2025.09.11.675507 2025
[20]

Liberati, D

A. Liberati, D. Altman, J. Tetzlaff, C. Mulrow, P. Gøtzsche, J. Ioannidis, Mike Clarke, Mike Clarke, P. Devereaux, J. Kleijnen, and D. Moher. 2009. The PRISMA Statement for Reporting Systematic Reviews and Meta-Analyses of Studies That Evaluate Health Care Interventions: Explanation and Elaboration.PLoS Med. (2009)

2009
[21]

McFarland

Peter McMahan and Daniel A. McFarland. 2021. Creative Destruction: The Structural Consequences of Scientific Curation. 86, 2 (2021), 341–376. doi:10. 1177/0003122421996323

2021
[22]

Tricco, Margaret Sampson, and Dou- glas G

David Moher, Jennifer Tetzlaff, Andrea C. Tricco, Margaret Sampson, and Dou- glas G. Altman. 2007. Epidemiology and reporting characteristics of systematic reviews. 4, 3 (2007), e78. doi:10.1371/journal.pmed.0040078

work page doi:10.1371/journal.pmed.0040078 2007
[23]

C. D. Mulrow. 1994. Systematic Reviews: Rationale for systematic reviews. 309, 6954 (1994), 597–599. doi:10.1136/bmj.309.6954.597

work page doi:10.1136/bmj.309.6954.597 1994
[24]

Mark Newman and David Gough. 2020. Systematic reviews in educa- tional research: Methodology, perspectives and application. 64, 3 (2020), 3–

2020
[25]

https://library.oapen.org/bitstream/handle/20.500.12657/23142/1007012.pdf? sequenc#page=22
[26]

2008.Systematic reviews in the social sciences: A practical guide

Mark Petticrew and Helen Roberts. 2008.Systematic reviews in the social sciences: A practical guide. John Wiley & Sons

2008
[27]

Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts.arXiv preprint arXiv:2205.01833, 2022

Jason Priem, Heather A. Piwowar, and Richard Orr. 2022. OpenAlex: A fully- open index of scholarly works, authors, venues, institutions, and concepts.ArXiv abs/2205.01833 (2022). https://api.semanticscholar.org/CorpusID:248512771

work page arXiv 2022
[28]

Harrisen Scells, Guido Zuccon, Bevan Koopman, Anthony Deacon, Leif Azzopardi, and Shlomo Geva. 2017. A Test Collection for Evaluating Retrieval of Studies for Inclusion in Systematic Reviews. InProc. of SIGIR 2017. ACM. doi:10.1145/ 3077136.3080707

work page arXiv 2017
[29]

Alexander Schniedermann, Clemens Blümel, and Arno Simons. 2022. On Top of the Hierarchy. How Guidelines shape systematic reviewing in Biomedicine. InEvidence in Action between Science and Society. Constructing, Validating, and Contesting Knowledge, Sarah Ehlers and Stefan Esselborn (Eds.). Routledge. doi:10. 4324/9781003188612-8

2022
[30]

Arno Simons and Alexander Schniedermann. 2021. The neglected politics behind evidence-based policy. Shedding light on instrument constituency dynamics. 49, 4 (2021), 513–529

2021
[31]

Liyan Tang, Philippe Laban, and Greg Durrett. 2024. MiniCheck: Efficient Fact- Checking of LLMs on Grounding Documents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguisti...

work page doi:10.18653/v1/ 2024
[32]

RNAstructure: software for RNA secondary structure prediction and analysis

Byron C. Wallace, Thomas A. Trikalinos, Joseph Lau, Carla E. Brodley, and Christopher H. Schmid. 2010. Semi-automated screening of biomedical citations for systematic reviews.BMC Bioinform.11 (2010), 55. doi:10.1186/1471-2105-11- Conference’17, July 2017, Washington, DC, USA Pierre Achkar, Tim Gollub, Arno Simons, Harrisen Scells, and Martin Potthast 55

work page doi:10.1186/1471-2105-11- 2010
[33]

Shuai Wang, Harrisen Scells, Justin Clark, Bevan Koopman, and Guido Zuccon
[34]

From Little Things Big Things Grow: A Collection with Seed Studies for Medical Systematic Review Literature Search.. InSIGIR. 3176–3186. doi:10.1145/ 3477495.3531748

work page arXiv
[35]

Shuai Wang, Harrisen Scells, Bevan Koopman, and Guido Zuccon. 2025. AutoBool: An Reinforcement-Learning trained LLM for Effective Automated Boolean Query Generation for Systematic Reviews.CoRRabs/2602.00005 (2025). arXiv:2602.00005 doi:10.48550/arXiv.2602.00005

work page doi:10.48550/arxiv.2602.00005 2025