Recognition: unknown
A Large-Scale, Cross-Disciplinary Corpus of Systematic Reviews
Pith reviewed 2026-05-09 20:53 UTC · model grok-4.3
The pith
A corpus of 301,871 systematic reviews across all scientific fields enables broad benchmarking and meta-analysis of evidence synthesis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present Webis-SR4ALL-26, a large-scale, cross-disciplinary corpus of 301,871 systematic reviews spanning all scientific fields as covered by OpenAlex. Using a multi-stage pre-processing pipeline, we link reviews to resolved OpenAlex metadata and reference lists and extract, when explicitly reported, structured method artifacts relevant to retrieval and screening. These artifacts include reported search strategies (Boolean queries or keyword lists) that we normalize into executable approximations, as well as reported inclusion and exclusion criteria. Together, these layers support cross-domain benchmarking of retrieval and screening components against review reference lists, training and评价
What carries the argument
Webis-SR4ALL-26 corpus produced by a multi-stage pre-processing pipeline that links reviews to OpenAlex metadata and extracts normalized search strategies plus inclusion criteria
If this is right
- Cross-domain benchmarking of retrieval and screening components against resolved reference lists becomes possible at large scale.
- Methods for extracting review artifacts such as search queries and criteria can be trained and evaluated on data from many fields.
- Comparative meta-science analyses of systematic review practices can be conducted across disciplines and across time.
- Baseline retrieval performance can be measured by executing the normalized search strategies inside OpenAlex and comparing results to reference lists.
Where Pith is reading between the lines
- The normalized search strategies could be reused as starting points for improving domain-specific academic search systems.
- Longitudinal updates to the corpus would allow tracking of changes in how reviews are conducted as publishing practices evolve.
- Differences in the frequency and structure of reported inclusion criteria between hard sciences and social sciences could be measured directly.
Load-bearing premise
The multi-stage pre-processing pipeline correctly links reviews to OpenAlex metadata and accurately extracts and normalizes search strategies and inclusion criteria with acceptable error rates across all disciplines.
What would settle it
Manual verification on a random sample of non-biomedical reviews showing that more than 20 percent of extracted search strategies or inclusion criteria do not match the original paper text would falsify the corpus quality claim.
Figures
read the original abstract
Existing benchmarks for systematic reviewing remain limited either in scale or in disciplinary coverage, with some collections comprising only a modest number of topics and others focusing primarily on biomedical research. We present Webis-SR4ALL-26, a large-scale, cross-disciplinary corpus of 301,871 systematic reviews spanning all scientific fields as covered by OpenAlex. Using a multi-stage pre-processing pipeline, we link reviews to resolved OpenAlex metadata and reference lists and extract, when explicitly reported, structured method artifacts relevant to retrieval and screening. These artifacts include reported search strategies (Boolean queries or keyword lists) that we normalize into executable approximations, as well as reported inclusion and exclusion criteria. Together, these layers support cross-domain benchmarking of retrieval and screening components against review reference lists, training and evaluation of extraction methods for review artifacts, and comparative meta-science analyses of systematic review practices across disciplines and time. To demonstrate one concrete use case, we report large-scale baseline retrieval signals by executing normalized search strategies in OpenAlex and comparing retrieved sets to resolved reference lists. We release the corpus and the pre-processing pipeline, along with code used for extraction validation and the retrieval demonstration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Webis-SR4ALL-26, a corpus of 301,871 systematic reviews spanning all fields in OpenAlex. A multi-stage pipeline links records to resolved metadata and reference lists, extracts reported search strategies (normalized to executable approximations) and inclusion/exclusion criteria, and supports cross-domain benchmarking; the authors demonstrate one use case by executing the normalized queries in OpenAlex and comparing results to reference lists. The corpus, pipeline code, and validation code are released publicly.
Significance. If the pipeline's identification and extraction steps prove reliable, the resource would meaningfully advance IR and meta-science by supplying the first large-scale, cross-disciplinary testbed for retrieval and screening evaluation, moving beyond small or biomedicine-only collections. Releasing the full pipeline and validation code is a clear strength that enables community inspection and extension.
major comments (1)
- [Section 3] Section 3 (Corpus Construction) and the validation subsection: the central claim that the corpus supports reliable cross-disciplinary benchmarking rests on the multi-stage pipeline correctly filtering systematic reviews and extracting/normalizing artifacts with acceptable error rates, yet no per-discipline precision/recall figures, confusion-matrix breakdowns, or error analysis for non-biomedical fields are reported. This is load-bearing because reporting conventions differ substantially outside biomedicine, directly affecting the trustworthiness of the reference lists and baseline signals.
minor comments (1)
- [Section 5] The demonstration of baseline retrieval signals would be clearer if accompanied by a table summarizing aggregate metrics (e.g., recall@K across disciplines) rather than only high-level statements.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the corpus and its potential contributions to IR and meta-science. We address the single major comment below.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Corpus Construction) and the validation subsection: the central claim that the corpus supports reliable cross-disciplinary benchmarking rests on the multi-stage pipeline correctly filtering systematic reviews and extracting/normalizing artifacts with acceptable error rates, yet no per-discipline precision/recall figures, confusion-matrix breakdowns, or error analysis for non-biomedical fields are reported. This is load-bearing because reporting conventions differ substantially outside biomedicine, directly affecting the trustworthiness of the reference lists and baseline signals.
Authors: We agree that the lack of per-discipline validation metrics is a substantive limitation for claims about reliable cross-disciplinary use, given known differences in reporting conventions. The manuscript reports only aggregate precision/recall for the filtering and extraction stages, based on a validation sample that was intended to span disciplines but without field-level breakdowns. In the revised version we will add to Section 3 a table of per-discipline metrics (precision, recall, and confusion-matrix summaries) for the major OpenAlex fields represented in the validation sample, together with a short discussion of sample-size limitations and the impact of varying conventions on reference-list quality. The released validation code will be updated with scripts to reproduce these stratified results. revision: yes
Circularity Check
No circularity: standard data-resource paper with no derivations or fitted predictions
full rationale
The paper presents a corpus (Webis-SR4ALL-26) built from OpenAlex via a multi-stage pipeline that links records, extracts artifacts, and normalizes queries. No equations, predictions, or first-principles results are claimed; the central contribution is the released dataset and code. Validation is mentioned as released but is not used to derive any quantity that reduces to the pipeline inputs by construction. No self-citation load-bearing, uniqueness theorems, or ansatz smuggling appear. The work is self-contained as a descriptive resource release.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2012.An Introduction to Systematic Reviews(London ; Thousand Oaks, Calif, 2012)
David Gough, Sandy Oliver, and James Thomas (Eds.). 2012.An Introduction to Systematic Reviews(London ; Thousand Oaks, Calif, 2012). Sage Publications Ltd
2012
-
[2]
Amal Alharbi and Mark Stevenson. 2019. A Dataset of Systematic Review Updates. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1257–126...
-
[3]
Joe Barrow, Raj Patel, Misha Kharkovski, Ben Davies, and Ryan Schmitt. 2025. SafePassage: High-Fidelity Information Extraction with Black Box LLMs.CoRR abs/2510.00276 (2025). arXiv:2510.00276 doi:10.48550/ARXIV.2510.00276
-
[4]
2012.How to do a systematic literature review in nursing: a step-by-step guide: A Step-By-Step Guide
Bettany-saltikov. 2012.How to do a systematic literature review in nursing: a step-by-step guide: A Step-By-Step Guide. Open University Press
2012
-
[5]
Clemens Blümel and Alexander Schniedermann. 2020. Studying review articles in scientometrics and beyond: a research agenda. 124, 1 (2020), 711–728. doi:10. 1007/s11192-020-03431-7
2020
-
[6]
Iain Chalmers. 1993. The Cochrane collaboration: preparing, maintaining, and disseminating systematic reviews of the effects of health care. 703, 1 (1993), 156–165
1993
-
[7]
Page, and VJHW Welch
Jacqueline Chandler, Miranda Cumpston, Tianjing Li, Matthew J. Page, and VJHW Welch. 2019. Cochrane handbook for systematic reviews of interven- tions. 4, 1002 (2019), 14651858. https://dariososafoula.wordpress.com/wp- content/uploads/2017/01/cochrane-handbook-for-systematic-reviews-of- interventions-2019-1.pdf
2019
-
[8]
Reducing workload in systematic review preparation using automated citation classifica- tion
Aaron M. Cohen, William R. Hersh, K. Peterson, and Po-Yin Yen. 2006. Research Paper: Reducing Workload in Systematic Review Preparation Using Automated Citation Classification.J. Am. Medical Informatics Assoc.13, 2 (2006), 206–219. doi:10.1197/JAMIA.M1929
-
[9]
Jack H. Culbert, Anne Hobert, Najko Jahn, Nick Haupka, Marion Schmidt, Paul Donner, and Philipp Mayr. 2025. Reference coverage analysis of OpenAlex compared to Web of Science and Scopus.Scientometrics130 (2025), 2475–2492. doi:10.1007/s11192-025-05293-3
-
[10]
Gordon Guyatt, John Cairns, David Churchill, Deborah Cook, Brian Haynes, Jack Hirsh, Jan Irvine, Mark Levine, Mitchell Levine, and Jim Nishikawa. 1992. Evidence-based medicine: a new approach to teaching the practice of medicine. 268, 17 (1992), 2420–2425
1992
-
[11]
Abdelhakim Hannousse and Salima Yahiouche. 2021. A Semi-automatic Docu- ment Screening System for Computer Science Systematic Reviews. InPattern Recognition and Artificial Intelligence - 5th Mediterranean Conference, MedPRAI 2021, Istanbul, Turkey, December 17-18, 2021, Proceedings (Communications in Computer and Information Science, Vol. 1543), Chawki Dj...
-
[12]
Howard, Jason Phillips, Kyle Miller, Arpit Tandon, Deepak Mav, Mihir R
Brian E. Howard, Jason Phillips, Kyle Miller, Arpit Tandon, Deepak Mav, Mihir R. Shah, Stephanie Holmgren, Katherine E. Pelch, Vickie Walker, Andrew A. Rooney, Malcolm Macleod, Ruchir R. Shah, and Kristina Thayer. 2016. SWIFT-Review: a text-mining workbench for systematic review.Systematic Reviews5 (2016), 87. doi:10.1186/s13643-016-0263-z
-
[13]
Evangelos Kanoulas, Dan Li, Leif Azzopardi, and René Spijker. 2017. CLEF 2017 Technologically Assisted Reviews in Empirical Medicine Overview. InWorking Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017 (CEUR Workshop Proceedings, Vol. 1866), Linda Cappellato, Nicola Ferro, Lorraine Goeuriot, and Thom...
2017
-
[14]
Evangelos Kanoulas, Dan Li, Leif Azzopardi, and René Spijker. 2018. CLEF 2018 Technologically Assisted Reviews in Empirical Medicine Overview. InWorking Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, A vignon, France, September 10-14, 2018 (CEUR Workshop Proceedings, Vol. 2125), Linda Cappellato, Nicola Ferro, Jian-Yun Nie, and Laure So...
2018
-
[15]
Evangelos Kanoulas, Dan Li, Leif Azzopardi, and René Spijker. 2019. CLEF 2019 Technology Assisted Reviews in Empirical Medicine Overview. InWorking Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019 (CEUR Workshop Proceedings, Vol. 2380), Linda Cappellato, Nicola Ferro, David E. Losada, and Henning ...
2019
-
[16]
Barbara Kitchenham. 2004. Procedures for performing systematic reviews.Keele, UK, Keele University33, 2004 (2004), 1–26
2004
-
[17]
Mendoza, Matthias Samwald, Petr Knoth, and Allan Hanbury
Wojciech Kusa, Óscar E. Mendoza, Matthias Samwald, Petr Knoth, and Allan Hanbury. 2023. CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 -...
2023
-
[18]
Guillaume Lame. 2019. Systematic Literature Reviews: An Introduction.Pro- ceedings of the Design Society: International Conference on Engineering Design1, 1 (2019), 1633–1642. doi:10.1017/dsi.2019.169
-
[19]
Zhijing Li, Yunwen Yu, Wenhao Gu, Tiantian Zhu, Haohua Song, Wenbin Guo, Xiao Yang, and Zexuan Zhu. 2025. Dual-LLM Adversarial Frame- work for Information Extraction from Research Literature.bioRxiv(2025). arXiv:https://www.biorxiv.org/content/early/2025/09/16/2025.09.11.675507.full.pdf doi:10.1101/2025.09.11.675507
-
[20]
Liberati, D
A. Liberati, D. Altman, J. Tetzlaff, C. Mulrow, P. Gøtzsche, J. Ioannidis, Mike Clarke, Mike Clarke, P. Devereaux, J. Kleijnen, and D. Moher. 2009. The PRISMA Statement for Reporting Systematic Reviews and Meta-Analyses of Studies That Evaluate Health Care Interventions: Explanation and Elaboration.PLoS Med. (2009)
2009
-
[21]
McFarland
Peter McMahan and Daniel A. McFarland. 2021. Creative Destruction: The Structural Consequences of Scientific Curation. 86, 2 (2021), 341–376. doi:10. 1177/0003122421996323
2021
-
[22]
Tricco, Margaret Sampson, and Dou- glas G
David Moher, Jennifer Tetzlaff, Andrea C. Tricco, Margaret Sampson, and Dou- glas G. Altman. 2007. Epidemiology and reporting characteristics of systematic reviews. 4, 3 (2007), e78. doi:10.1371/journal.pmed.0040078
-
[23]
C. D. Mulrow. 1994. Systematic Reviews: Rationale for systematic reviews. 309, 6954 (1994), 597–599. doi:10.1136/bmj.309.6954.597
-
[24]
Mark Newman and David Gough. 2020. Systematic reviews in educa- tional research: Methodology, perspectives and application. 64, 3 (2020), 3–
2020
-
[25]
https://library.oapen.org/bitstream/handle/20.500.12657/23142/1007012.pdf? sequenc#page=22
-
[26]
2008.Systematic reviews in the social sciences: A practical guide
Mark Petticrew and Helen Roberts. 2008.Systematic reviews in the social sciences: A practical guide. John Wiley & Sons
2008
-
[27]
Jason Priem, Heather A. Piwowar, and Richard Orr. 2022. OpenAlex: A fully- open index of scholarly works, authors, venues, institutions, and concepts.ArXiv abs/2205.01833 (2022). https://api.semanticscholar.org/CorpusID:248512771
- [28]
-
[29]
Alexander Schniedermann, Clemens Blümel, and Arno Simons. 2022. On Top of the Hierarchy. How Guidelines shape systematic reviewing in Biomedicine. InEvidence in Action between Science and Society. Constructing, Validating, and Contesting Knowledge, Sarah Ehlers and Stefan Esselborn (Eds.). Routledge. doi:10. 4324/9781003188612-8
2022
-
[30]
Arno Simons and Alexander Schniedermann. 2021. The neglected politics behind evidence-based policy. Shedding light on instrument constituency dynamics. 49, 4 (2021), 513–529
2021
-
[31]
Liyan Tang, Philippe Laban, and Greg Durrett. 2024. MiniCheck: Efficient Fact- Checking of LLMs on Grounding Documents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguisti...
-
[32]
RNAstructure: software for RNA secondary structure prediction and analysis
Byron C. Wallace, Thomas A. Trikalinos, Joseph Lau, Carla E. Brodley, and Christopher H. Schmid. 2010. Semi-automated screening of biomedical citations for systematic reviews.BMC Bioinform.11 (2010), 55. doi:10.1186/1471-2105-11- Conference’17, July 2017, Washington, DC, USA Pierre Achkar, Tim Gollub, Arno Simons, Harrisen Scells, and Martin Potthast 55
-
[33]
Shuai Wang, Harrisen Scells, Justin Clark, Bevan Koopman, and Guido Zuccon
- [34]
-
[35]
Shuai Wang, Harrisen Scells, Bevan Koopman, and Guido Zuccon. 2025. AutoBool: An Reinforcement-Learning trained LLM for Effective Automated Boolean Query Generation for Systematic Reviews.CoRRabs/2602.00005 (2025). arXiv:2602.00005 doi:10.48550/arXiv.2602.00005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.