A Collection of Systematic Reviews in Computer Science

Pierre Achkar; Tim Gollub amd Martin Potthast

arxiv: 2604.16330 · v1 · submitted 2026-03-11 · 💻 cs.IR · cs.DL

A Collection of Systematic Reviews in Computer Science

Pierre Achkar , Tim Gollub amd Martin Potthast This is my paper

Pith reviewed 2026-05-15 13:14 UTC · model grok-4.3

classification 💻 cs.IR cs.DL

keywords systematic reviewsBoolean queriesinformation retrievaldatasetreproducible researchquery generationcomputer science

0 comments

The pith

SR4CS releases 1,212 computer science systematic reviews with their original Boolean queries to enable reproducible experiments on automated retrieval and screening.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new resource called SR4CS that gathers 1,212 systematic reviews from computer science along with the expert Boolean search queries used to find their references. This collection also supplies the resolved references and structured metadata from those reviews, plus simplified versions of the queries that run only over titles and abstracts. The goal is to give researchers outside the biomedical field a shared testbed for studying how to automate query creation, document retrieval, and screening steps. Baseline tests in the paper compare the original expert queries against LLM-generated queries, BM25 ranking, and dense retrieval to show measurable differences in precision, recall, and ranking behavior. The full collection is made available under an open license with documentation so others can run controlled, repeatable evaluations.

Core claim

SR4CS is a corpus of 1,212 computer science systematic reviews that includes the original expert-designed Boolean search queries, 104,316 resolved references, and methodological metadata; the paper also supplies normalized approximations of those queries that operate solely over titles and abstracts, allowing direct comparison of expert queries with zero-shot LLM-generated Boolean queries, BM25, and dense retrieval under a single evaluation protocol.

What carries the argument

SR4CS corpus of 1,212 reviews paired with original expert Boolean queries and their normalized title-abstract approximations, which together serve as the shared testbed for measuring retrieval and screening performance.

If this is right

Researchers can now run controlled comparisons between human-written Boolean queries and LLM-generated queries on the same set of resolved references.
Standard retrieval methods such as BM25 and dense vectors can be evaluated for their ability to replace or augment Boolean search in systematic review workflows.
The normalized title-abstract query versions make it possible to isolate the contribution of query formulation from full-text access.
Future automation systems can be benchmarked for both recall of relevant papers and reduction in manual screening effort using the provided metadata.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The collection could serve as a seed for building larger cross-domain benchmarks that combine computer science reviews with biomedical ones to test domain adaptation of retrieval models.
Releasing the queries in both original and normalized forms may encourage development of query-rewriting techniques that preserve expert intent while improving machine readability.
The dataset opens the possibility of studying how query complexity correlates with screening workload across different subfields of computer science.

Load-bearing premise

The 1,212 collected reviews are representative of typical computer science systematic reviews and the normalized title-abstract queries still capture the core intent of the original expert Boolean queries.

What would settle it

Running the same baseline retrieval experiments on a fresh independent sample of computer science systematic reviews and obtaining substantially different precision-recall curves or ranking patterns would indicate that the SR4CS collection does not generalize.

Figures

Figures reproduced from arXiv: 2604.16330 by Pierre Achkar, Tim Gollub amd Martin Potthast.

**Figure 1.** Figure 1: SR4CS-25 construction pipeline: collection, filtering, parsing, extraction, and reference resolution. 3.1. Data Collection A set of candidate systematic reviews was retrieved from DBLP by searching titles for “systematic review”, yielding 11,317 results. We applied filter rules to retain only genuine systematic reviews that required clear wording (e.g., “systematic review of/on”), peer review procedures, o… view at source ↗

read the original abstract

Systematic reviews are the standard method for synthesizing scientific evidence, but their creation requires substantial manual effort, particularly during retrieval and screening. While recent work has explored automating these steps, evaluation resources remain largely confined to the biomedical domain, limiting reproducible experimentation in other domains. This paper introduces SR4CS, a large-scale collection of systematic reviews in computer science, designed to support reproducible research on Boolean query generation, retrieval, and screening. The corpus comprises 1,212 systematic reviews with their original expert-designed Boolean search queries, 104,316 resolved references, and structured methodological metadata. For controlled evaluation, the original Boolean queries are additionally provided in a normalized, approximated form operating over titles and abstracts. To illustrate the intended use of the collection, baseline experiments compare the approximated expert Boolean queries with zero-shot LLM-generated Boolean queries, BM25, and dense retrieval under a unified evaluation setting. The results highlight systematic differences in precision, recall, and ranking behavior across retrieval paradigms and expose limitations of naive zero-shot Boolean generation. SR4CS is released under an open license on Zenodo (https://doi.org/10.5281/zenodo.17163932), together with documentation and code (https://github.com/webis-de/scolia26-sr4cs), to enable reproducible evaluation and future research on scaling systematic review automation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SR4CS is a useful new open dataset of 1,212 CS systematic reviews with original Boolean queries and references, filling a clear gap even if collection details are thin.

read the letter

This paper's main contribution is releasing SR4CS: 1,212 computer science systematic reviews, their expert Boolean queries, 104k resolved references, and metadata, plus normalized title-abstract versions for easier testing. It also ships baselines comparing those queries against BM25, dense retrieval, and zero-shot LLM queries. The resource is openly available on Zenodo and GitHub with code, which is the right way to do a dataset paper. That alone makes it worth having for anyone working on retrieval or screening automation outside biomedicine. The baselines are straightforward illustrations rather than deep claims, and they show plausible differences in precision and recall across methods. The collection is new for the CS domain at this scale, and the normalized queries are a practical addition for controlled experiments. The soft spot is the lack of detail on how the reviews were gathered, deduplicated, or checked for quality. Without that, it's hard to judge how representative the set is of typical CS reviews or whether the original queries were preserved accurately in the normalized form. The paper treats the release itself as the result, so the missing methods section is the main limitation rather than a fatal flaw. This is for information retrieval and evidence-synthesis researchers who need test collections beyond PubMed. A serious editor should send it to peer review because the artifacts are real and the gap it fills is documented; reviewers can push for the collection methods and any quality checks in revision. I'd bring it to a reading group focused on IR datasets or automation tools.

Referee Report

1 major / 0 minor

Summary. The paper introduces SR4CS, a dataset comprising 1,212 systematic reviews in computer science, including their original expert Boolean queries, 104,316 resolved references, structured metadata, and normalized title-abstract query approximations, released openly on Zenodo and GitHub. It also reports baseline experiments comparing the approximated expert queries against zero-shot LLM-generated Boolean queries, BM25, and dense retrieval in a unified setting, highlighting differences in precision, recall, and ranking behavior.

Significance. If the collection methodology is sound and the reviews are representative, SR4CS fills a notable gap by providing the first large-scale, openly available resource for systematic-review automation research outside the biomedical domain. The inclusion of both original and normalized queries, plus reproducible baselines and code, directly supports experimentation on Boolean query generation, retrieval, and screening while enabling verification through the released artifacts.

major comments (1)

[Abstract] Abstract: the description of the corpus (size, contents, baselines) is given, but no details appear on collection methodology, deduplication, or quality control; without these the representativeness of the 1,212 reviews and the fidelity of the normalized queries cannot be assessed, which is load-bearing for the claim that SR4CS supports reproducible research.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of SR4CS. The single major comment highlights a valid point about the abstract's conciseness. We have revised the abstract to incorporate a brief description of the collection methodology, deduplication, and quality control, while preserving its length and focus. The full manuscript already details these aspects in Section 3.

read point-by-point responses

Referee: [Abstract] Abstract: the description of the corpus (size, contents, baselines) is given, but no details appear on collection methodology, deduplication, or quality control; without these the representativeness of the 1,212 reviews and the fidelity of the normalized queries cannot be assessed, which is load-bearing for the claim that SR4CS supports reproducible research.

Authors: We agree that the abstract should briefly address collection methodology to support claims of representativeness and reproducibility. The full manuscript (Section 3) describes sourcing reviews from ACM, IEEE, and other databases via targeted searches, followed by deduplication using title+author+year matching and manual verification, plus quality control through expert review of a random sample for query fidelity and reference completeness. To address the comment, we have updated the abstract with the following addition: 'Reviews were collected via systematic searches across major CS databases, deduplicated using metadata matching, and validated for query and reference quality.' This makes the abstract self-contained without altering its structure or length. revision: yes

Circularity Check

0 steps flagged

Dataset release with no derivations or predictions

full rationale

The paper is a dataset release paper whose central claim is the introduction of SR4CS (1,212 reviews, original Boolean queries, 104k references, normalized title/abstract versions) as an open resource. No equations, fitted parameters, or predictions are present that reduce to prior quantities by construction. Baselines are presented only as illustrations of intended use under a unified evaluation setting. All content is externally grounded in the released Zenodo/GitHub artifacts, which directly enable verification of contents and utility. No self-citation chains, ansatzes, or uniqueness theorems are invoked to support the release claim itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is purely a curated dataset release.

pith-pipeline@v0.9.0 · 5534 in / 1022 out tokens · 39643 ms · 2026-05-15T13:14:51.085039+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Liberati, D

A. Liberati, D. Altman, J. Tetzlaff, C. Mulrow, P. Gøtzsche, J. Ioannidis, M. Clarke, M. Clarke, P. Devereaux, J. Kleijnen, D. Moher, The prisma statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: Explanation and elaboration, PLoS Med. (2009)

work page 2009
[2]

Lamé, Systematic literature reviews: An introduction, Proc

G. Lamé, Systematic literature reviews: An introduction, Proc. of Design Soc.: Int. Conf. on Engineering Design (2019)

work page 2019
[3]

Lefebvre, J

C. Lefebvre, J. Glanville, S. Briscoe, A. Littlewood, C. Marshall, M.-I. Metzendorf, A. Noel- Storr, T. Rader, F. Shokraneh, J. Thomas, L. S. Wieland, Searching for and selecting stud- ies, John Wiley & Sons, Ltd, 2019. doi: https://doi.org/10.1002/9781119536604.ch4. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119536604.ch4

work page doi:10.1002/9781119536604.ch4 2019
[4]

MacFarlane, T

A. MacFarlane, T. Russell-Rose, F. Shokraneh, Search strategy formulation for systematic reviews: Issues, chal- lenges and opportunities, Intel. Sys. with Applications (2022). doi: https://doi.org/10.1016/j.iswa.2022. 200091

work page doi:10.1016/j.iswa.2022 2022
[5]

Scells, G

H. Scells, G. Zuccon, B. Koopman, A comparison of automatic boolean query formulation for systematic reviews, Inf. Retr. J. (2021). doi:10.1007/S10791-020-09381-1

work page doi:10.1007/s10791-020-09381-1 2021
[6]

S. Wang, H. Scells, B. Koopman, G. Zuccon, Can chatgpt write a good boolean query for systematic review literature search?, in: Proc. of SIGIR 2023, ACM, 2023. doi:10.1145/3539618.3591703

work page doi:10.1145/3539618.3591703 2023
[7]

S. Wang, H. Scells, S. Zhuang, M. Potthast, B. Koopman, G. Zuccon, Zero-shot generative large language mod- els for systematic review screening automation, in: Proc. of ECIR 2024, LNCS, Springer, 2024. doi: 10.1007/ 978-3-031-56027-9\_25

work page 2024
[8]

M. A. Sami, Z. Rasheed, K. Kemell, M. Waseem, T. Kilamo, M. Saari, A. Nguyen-Duc, K. Systä, P. Abrahamsson, System for systematic literature review using multiple AI agents: Concept and an empirical evaluation, CoRR (2024). doi:10.48550/ARXIV.2403.08399.arXiv:2403.08399

work page doi:10.48550/arxiv.2403.08399.arxiv:2403.08399 2024
[9]

Scells, G

H. Scells, G. Zuccon, B. Koopman, A. Deacon, L. Azzopardi, S. Geva, A test collection for evaluating retrieval of studies for inclusion in systematic reviews, in: Proc. of SIGIR 2017, ACM, 2017. doi: 10.1145/3077136.3080707

work page doi:10.1145/3077136.3080707 2017
[10]

Kanoulas, D

E. Kanoulas, D. Li, L. Azzopardi, R. Spijker, CLEF 2019 technology assisted reviews in empirical medicine overview, in: W.N. of CLEF 2019, CEUR-WS.org, 2019

work page 2019
[11]

S. Wang, H. Scells, J. Clark, B. Koopman, G. Zuccon, From little things big things grow: A collection with seed studies for medical systematic review literature search, in: Proc. of SIGIR 2022, ACM, 2022. doi:10.1145/3477495. 3531748

work page doi:10.1145/3477495 2022
[12]

M. P. Polak, D. Morgan, Extracting accurate materials data from research papers with conversational lan- guage models and prompt engineering - example of chatgpt, CoRR (2023). doi:10.48550/ARXIV.2303.05352. arXiv:2303.05352

work page doi:10.48550/arxiv.2303.05352 2023
[13]

Gartlehner, L

G. Gartlehner, L. Kahwati, R. Hilscher, I. Thomas, S. Kugley, K. Crotty, M. Viswanathan, B. Nussbaumer-Streit, G. Booth, N. Erskine, A. Konet, R. Chew, Data extraction for evidence synthesis using a large language model: A proof-of-concept study, Research Synthesis Methods (2024). doi: https://doi.org/10.1002/jrsm.1710. arXiv:https://onlinelibrary.wiley.c...

work page doi:10.1002/jrsm.1710 2024
[14]

Backes, A

T. Backes, A. Iurshina, M. A. Shahid, P. Mayr, Comparing free reference extraction pipelines, Int. J. Digit. Libr. (2024). doi:10.1007/S00799-024-00404-6

work page doi:10.1007/s00799-024-00404-6 2024
[15]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, A. Rangapur, C. Wilhelm, K. Lo, L. Soldaini, olmocr: Unlocking trillions of tokens in pdfs with vision language models, 2025.arXiv:2502.18443

work page arXiv 2025
[16]

H. Lai, J. Liu, C. Bai, H. Liu, B. Pan, X. Luo, L. Hou, W. Zhao, D. Xia, J. Tian, Y. Chen, L. Zhang, J. Estill, J. Liu, X. Liao, N. Shi, X. Sun, H. Shang, Z. Bian, K. Yang, L. Huang, L. Ge, H. Li, Y. Wang, H. Zhang, D. Zhu, D. Peng, F. Wang, Y. Li, S. Tang, H. Liu, Z. Li, Z. Yang, X. Yu, Y. Qin, Language models for data extraction and risk of bias assessm...

work page doi:10.1038/s41746-025-01457-w 2025
[17]

Z. Li, Y. Yu, W. Gu, T. Zhu, H. Song, W. Guo, X. Yang, Z. Zhu, Dual-llm adversarial framework for information extraction from research literature, bioRxiv (2025). URL: https: //www.biorxiv.org/content/early/2025/09/16/2025.09.11.675507. doi: 10.1101/2025.09.11.675507. arXiv:https://www.biorxiv.org/content/early/2025/09/16/2025.09.11.675507.full.pdf

work page doi:10.1101/2025.09.11.675507 2025
[18]

Barrow, R

J. Barrow, R. Patel, M. Kharkovski, B. Davies, R. Schmitt, Safepassage: High-fidelity information extraction with black box llms, CoRR abs/2510.00276 (2025). URL: https://doi.org/10.48550/arXiv.2510.00276. doi:10.48550/ARXIV. 2510.00276.arXiv:2510.00276

work page doi:10.48550/arxiv.2510.00276 2025

[1] [1]

Liberati, D

A. Liberati, D. Altman, J. Tetzlaff, C. Mulrow, P. Gøtzsche, J. Ioannidis, M. Clarke, M. Clarke, P. Devereaux, J. Kleijnen, D. Moher, The prisma statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: Explanation and elaboration, PLoS Med. (2009)

work page 2009

[2] [2]

Lamé, Systematic literature reviews: An introduction, Proc

G. Lamé, Systematic literature reviews: An introduction, Proc. of Design Soc.: Int. Conf. on Engineering Design (2019)

work page 2019

[3] [3]

Lefebvre, J

C. Lefebvre, J. Glanville, S. Briscoe, A. Littlewood, C. Marshall, M.-I. Metzendorf, A. Noel- Storr, T. Rader, F. Shokraneh, J. Thomas, L. S. Wieland, Searching for and selecting stud- ies, John Wiley & Sons, Ltd, 2019. doi: https://doi.org/10.1002/9781119536604.ch4. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119536604.ch4

work page doi:10.1002/9781119536604.ch4 2019

[4] [4]

MacFarlane, T

A. MacFarlane, T. Russell-Rose, F. Shokraneh, Search strategy formulation for systematic reviews: Issues, chal- lenges and opportunities, Intel. Sys. with Applications (2022). doi: https://doi.org/10.1016/j.iswa.2022. 200091

work page doi:10.1016/j.iswa.2022 2022

[5] [5]

Scells, G

H. Scells, G. Zuccon, B. Koopman, A comparison of automatic boolean query formulation for systematic reviews, Inf. Retr. J. (2021). doi:10.1007/S10791-020-09381-1

work page doi:10.1007/s10791-020-09381-1 2021

[6] [6]

S. Wang, H. Scells, B. Koopman, G. Zuccon, Can chatgpt write a good boolean query for systematic review literature search?, in: Proc. of SIGIR 2023, ACM, 2023. doi:10.1145/3539618.3591703

work page doi:10.1145/3539618.3591703 2023

[7] [7]

S. Wang, H. Scells, S. Zhuang, M. Potthast, B. Koopman, G. Zuccon, Zero-shot generative large language mod- els for systematic review screening automation, in: Proc. of ECIR 2024, LNCS, Springer, 2024. doi: 10.1007/ 978-3-031-56027-9\_25

work page 2024

[8] [8]

M. A. Sami, Z. Rasheed, K. Kemell, M. Waseem, T. Kilamo, M. Saari, A. Nguyen-Duc, K. Systä, P. Abrahamsson, System for systematic literature review using multiple AI agents: Concept and an empirical evaluation, CoRR (2024). doi:10.48550/ARXIV.2403.08399.arXiv:2403.08399

work page doi:10.48550/arxiv.2403.08399.arxiv:2403.08399 2024

[9] [9]

Scells, G

H. Scells, G. Zuccon, B. Koopman, A. Deacon, L. Azzopardi, S. Geva, A test collection for evaluating retrieval of studies for inclusion in systematic reviews, in: Proc. of SIGIR 2017, ACM, 2017. doi: 10.1145/3077136.3080707

work page doi:10.1145/3077136.3080707 2017

[10] [10]

Kanoulas, D

E. Kanoulas, D. Li, L. Azzopardi, R. Spijker, CLEF 2019 technology assisted reviews in empirical medicine overview, in: W.N. of CLEF 2019, CEUR-WS.org, 2019

work page 2019

[11] [11]

S. Wang, H. Scells, J. Clark, B. Koopman, G. Zuccon, From little things big things grow: A collection with seed studies for medical systematic review literature search, in: Proc. of SIGIR 2022, ACM, 2022. doi:10.1145/3477495. 3531748

work page doi:10.1145/3477495 2022

[12] [12]

M. P. Polak, D. Morgan, Extracting accurate materials data from research papers with conversational lan- guage models and prompt engineering - example of chatgpt, CoRR (2023). doi:10.48550/ARXIV.2303.05352. arXiv:2303.05352

work page doi:10.48550/arxiv.2303.05352 2023

[13] [13]

Gartlehner, L

G. Gartlehner, L. Kahwati, R. Hilscher, I. Thomas, S. Kugley, K. Crotty, M. Viswanathan, B. Nussbaumer-Streit, G. Booth, N. Erskine, A. Konet, R. Chew, Data extraction for evidence synthesis using a large language model: A proof-of-concept study, Research Synthesis Methods (2024). doi: https://doi.org/10.1002/jrsm.1710. arXiv:https://onlinelibrary.wiley.c...

work page doi:10.1002/jrsm.1710 2024

[14] [14]

Backes, A

T. Backes, A. Iurshina, M. A. Shahid, P. Mayr, Comparing free reference extraction pipelines, Int. J. Digit. Libr. (2024). doi:10.1007/S00799-024-00404-6

work page doi:10.1007/s00799-024-00404-6 2024

[15] [15]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, A. Rangapur, C. Wilhelm, K. Lo, L. Soldaini, olmocr: Unlocking trillions of tokens in pdfs with vision language models, 2025.arXiv:2502.18443

work page arXiv 2025

[16] [16]

H. Lai, J. Liu, C. Bai, H. Liu, B. Pan, X. Luo, L. Hou, W. Zhao, D. Xia, J. Tian, Y. Chen, L. Zhang, J. Estill, J. Liu, X. Liao, N. Shi, X. Sun, H. Shang, Z. Bian, K. Yang, L. Huang, L. Ge, H. Li, Y. Wang, H. Zhang, D. Zhu, D. Peng, F. Wang, Y. Li, S. Tang, H. Liu, Z. Li, Z. Yang, X. Yu, Y. Qin, Language models for data extraction and risk of bias assessm...

work page doi:10.1038/s41746-025-01457-w 2025

[17] [17]

Z. Li, Y. Yu, W. Gu, T. Zhu, H. Song, W. Guo, X. Yang, Z. Zhu, Dual-llm adversarial framework for information extraction from research literature, bioRxiv (2025). URL: https: //www.biorxiv.org/content/early/2025/09/16/2025.09.11.675507. doi: 10.1101/2025.09.11.675507. arXiv:https://www.biorxiv.org/content/early/2025/09/16/2025.09.11.675507.full.pdf

work page doi:10.1101/2025.09.11.675507 2025

[18] [18]

Barrow, R

J. Barrow, R. Patel, M. Kharkovski, B. Davies, R. Schmitt, Safepassage: High-fidelity information extraction with black box llms, CoRR abs/2510.00276 (2025). URL: https://doi.org/10.48550/arXiv.2510.00276. doi:10.48550/ARXIV. 2510.00276.arXiv:2510.00276

work page doi:10.48550/arxiv.2510.00276 2025