A Collection of Systematic Reviews in Computer Science
Pith reviewed 2026-05-15 13:14 UTC · model grok-4.3
The pith
SR4CS releases 1,212 computer science systematic reviews with their original Boolean queries to enable reproducible experiments on automated retrieval and screening.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SR4CS is a corpus of 1,212 computer science systematic reviews that includes the original expert-designed Boolean search queries, 104,316 resolved references, and methodological metadata; the paper also supplies normalized approximations of those queries that operate solely over titles and abstracts, allowing direct comparison of expert queries with zero-shot LLM-generated Boolean queries, BM25, and dense retrieval under a single evaluation protocol.
What carries the argument
SR4CS corpus of 1,212 reviews paired with original expert Boolean queries and their normalized title-abstract approximations, which together serve as the shared testbed for measuring retrieval and screening performance.
If this is right
- Researchers can now run controlled comparisons between human-written Boolean queries and LLM-generated queries on the same set of resolved references.
- Standard retrieval methods such as BM25 and dense vectors can be evaluated for their ability to replace or augment Boolean search in systematic review workflows.
- The normalized title-abstract query versions make it possible to isolate the contribution of query formulation from full-text access.
- Future automation systems can be benchmarked for both recall of relevant papers and reduction in manual screening effort using the provided metadata.
Where Pith is reading between the lines
- The collection could serve as a seed for building larger cross-domain benchmarks that combine computer science reviews with biomedical ones to test domain adaptation of retrieval models.
- Releasing the queries in both original and normalized forms may encourage development of query-rewriting techniques that preserve expert intent while improving machine readability.
- The dataset opens the possibility of studying how query complexity correlates with screening workload across different subfields of computer science.
Load-bearing premise
The 1,212 collected reviews are representative of typical computer science systematic reviews and the normalized title-abstract queries still capture the core intent of the original expert Boolean queries.
What would settle it
Running the same baseline retrieval experiments on a fresh independent sample of computer science systematic reviews and obtaining substantially different precision-recall curves or ranking patterns would indicate that the SR4CS collection does not generalize.
Figures
read the original abstract
Systematic reviews are the standard method for synthesizing scientific evidence, but their creation requires substantial manual effort, particularly during retrieval and screening. While recent work has explored automating these steps, evaluation resources remain largely confined to the biomedical domain, limiting reproducible experimentation in other domains. This paper introduces SR4CS, a large-scale collection of systematic reviews in computer science, designed to support reproducible research on Boolean query generation, retrieval, and screening. The corpus comprises 1,212 systematic reviews with their original expert-designed Boolean search queries, 104,316 resolved references, and structured methodological metadata. For controlled evaluation, the original Boolean queries are additionally provided in a normalized, approximated form operating over titles and abstracts. To illustrate the intended use of the collection, baseline experiments compare the approximated expert Boolean queries with zero-shot LLM-generated Boolean queries, BM25, and dense retrieval under a unified evaluation setting. The results highlight systematic differences in precision, recall, and ranking behavior across retrieval paradigms and expose limitations of naive zero-shot Boolean generation. SR4CS is released under an open license on Zenodo (https://doi.org/10.5281/zenodo.17163932), together with documentation and code (https://github.com/webis-de/scolia26-sr4cs), to enable reproducible evaluation and future research on scaling systematic review automation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SR4CS, a dataset comprising 1,212 systematic reviews in computer science, including their original expert Boolean queries, 104,316 resolved references, structured metadata, and normalized title-abstract query approximations, released openly on Zenodo and GitHub. It also reports baseline experiments comparing the approximated expert queries against zero-shot LLM-generated Boolean queries, BM25, and dense retrieval in a unified setting, highlighting differences in precision, recall, and ranking behavior.
Significance. If the collection methodology is sound and the reviews are representative, SR4CS fills a notable gap by providing the first large-scale, openly available resource for systematic-review automation research outside the biomedical domain. The inclusion of both original and normalized queries, plus reproducible baselines and code, directly supports experimentation on Boolean query generation, retrieval, and screening while enabling verification through the released artifacts.
major comments (1)
- [Abstract] Abstract: the description of the corpus (size, contents, baselines) is given, but no details appear on collection methodology, deduplication, or quality control; without these the representativeness of the 1,212 reviews and the fidelity of the normalized queries cannot be assessed, which is load-bearing for the claim that SR4CS supports reproducible research.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of SR4CS. The single major comment highlights a valid point about the abstract's conciseness. We have revised the abstract to incorporate a brief description of the collection methodology, deduplication, and quality control, while preserving its length and focus. The full manuscript already details these aspects in Section 3.
read point-by-point responses
-
Referee: [Abstract] Abstract: the description of the corpus (size, contents, baselines) is given, but no details appear on collection methodology, deduplication, or quality control; without these the representativeness of the 1,212 reviews and the fidelity of the normalized queries cannot be assessed, which is load-bearing for the claim that SR4CS supports reproducible research.
Authors: We agree that the abstract should briefly address collection methodology to support claims of representativeness and reproducibility. The full manuscript (Section 3) describes sourcing reviews from ACM, IEEE, and other databases via targeted searches, followed by deduplication using title+author+year matching and manual verification, plus quality control through expert review of a random sample for query fidelity and reference completeness. To address the comment, we have updated the abstract with the following addition: 'Reviews were collected via systematic searches across major CS databases, deduplicated using metadata matching, and validated for query and reference quality.' This makes the abstract self-contained without altering its structure or length. revision: yes
Circularity Check
Dataset release with no derivations or predictions
full rationale
The paper is a dataset release paper whose central claim is the introduction of SR4CS (1,212 reviews, original Boolean queries, 104k references, normalized title/abstract versions) as an open resource. No equations, fitted parameters, or predictions are present that reduce to prior quantities by construction. Baselines are presented only as illustrations of intended use under a unified evaluation setting. All content is externally grounded in the released Zenodo/GitHub artifacts, which directly enable verification of contents and utility. No self-citation chains, ansatzes, or uniqueness theorems are invoked to support the release claim itself.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. Liberati, D. Altman, J. Tetzlaff, C. Mulrow, P. Gøtzsche, J. Ioannidis, M. Clarke, M. Clarke, P. Devereaux, J. Kleijnen, D. Moher, The prisma statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: Explanation and elaboration, PLoS Med. (2009)
work page 2009
-
[2]
Lamé, Systematic literature reviews: An introduction, Proc
G. Lamé, Systematic literature reviews: An introduction, Proc. of Design Soc.: Int. Conf. on Engineering Design (2019)
work page 2019
-
[3]
C. Lefebvre, J. Glanville, S. Briscoe, A. Littlewood, C. Marshall, M.-I. Metzendorf, A. Noel- Storr, T. Rader, F. Shokraneh, J. Thomas, L. S. Wieland, Searching for and selecting stud- ies, John Wiley & Sons, Ltd, 2019. doi: https://doi.org/10.1002/9781119536604.ch4. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119536604.ch4
-
[4]
A. MacFarlane, T. Russell-Rose, F. Shokraneh, Search strategy formulation for systematic reviews: Issues, chal- lenges and opportunities, Intel. Sys. with Applications (2022). doi: https://doi.org/10.1016/j.iswa.2022. 200091
-
[5]
H. Scells, G. Zuccon, B. Koopman, A comparison of automatic boolean query formulation for systematic reviews, Inf. Retr. J. (2021). doi:10.1007/S10791-020-09381-1
-
[6]
S. Wang, H. Scells, B. Koopman, G. Zuccon, Can chatgpt write a good boolean query for systematic review literature search?, in: Proc. of SIGIR 2023, ACM, 2023. doi:10.1145/3539618.3591703
-
[7]
S. Wang, H. Scells, S. Zhuang, M. Potthast, B. Koopman, G. Zuccon, Zero-shot generative large language mod- els for systematic review screening automation, in: Proc. of ECIR 2024, LNCS, Springer, 2024. doi: 10.1007/ 978-3-031-56027-9\_25
work page 2024
-
[8]
M. A. Sami, Z. Rasheed, K. Kemell, M. Waseem, T. Kilamo, M. Saari, A. Nguyen-Duc, K. Systä, P. Abrahamsson, System for systematic literature review using multiple AI agents: Concept and an empirical evaluation, CoRR (2024). doi:10.48550/ARXIV.2403.08399.arXiv:2403.08399
work page doi:10.48550/arxiv.2403.08399.arxiv:2403.08399 2024
-
[9]
H. Scells, G. Zuccon, B. Koopman, A. Deacon, L. Azzopardi, S. Geva, A test collection for evaluating retrieval of studies for inclusion in systematic reviews, in: Proc. of SIGIR 2017, ACM, 2017. doi: 10.1145/3077136.3080707
-
[10]
E. Kanoulas, D. Li, L. Azzopardi, R. Spijker, CLEF 2019 technology assisted reviews in empirical medicine overview, in: W.N. of CLEF 2019, CEUR-WS.org, 2019
work page 2019
-
[11]
S. Wang, H. Scells, J. Clark, B. Koopman, G. Zuccon, From little things big things grow: A collection with seed studies for medical systematic review literature search, in: Proc. of SIGIR 2022, ACM, 2022. doi:10.1145/3477495. 3531748
-
[12]
M. P. Polak, D. Morgan, Extracting accurate materials data from research papers with conversational lan- guage models and prompt engineering - example of chatgpt, CoRR (2023). doi:10.48550/ARXIV.2303.05352. arXiv:2303.05352
-
[13]
G. Gartlehner, L. Kahwati, R. Hilscher, I. Thomas, S. Kugley, K. Crotty, M. Viswanathan, B. Nussbaumer-Streit, G. Booth, N. Erskine, A. Konet, R. Chew, Data extraction for evidence synthesis using a large language model: A proof-of-concept study, Research Synthesis Methods (2024). doi: https://doi.org/10.1002/jrsm.1710. arXiv:https://onlinelibrary.wiley.c...
-
[14]
T. Backes, A. Iurshina, M. A. Shahid, P. Mayr, Comparing free reference extraction pipelines, Int. J. Digit. Libr. (2024). doi:10.1007/S00799-024-00404-6
-
[15]
J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, A. Rangapur, C. Wilhelm, K. Lo, L. Soldaini, olmocr: Unlocking trillions of tokens in pdfs with vision language models, 2025.arXiv:2502.18443
-
[16]
H. Lai, J. Liu, C. Bai, H. Liu, B. Pan, X. Luo, L. Hou, W. Zhao, D. Xia, J. Tian, Y. Chen, L. Zhang, J. Estill, J. Liu, X. Liao, N. Shi, X. Sun, H. Shang, Z. Bian, K. Yang, L. Huang, L. Ge, H. Li, Y. Wang, H. Zhang, D. Zhu, D. Peng, F. Wang, Y. Li, S. Tang, H. Liu, Z. Li, Z. Yang, X. Yu, Y. Qin, Language models for data extraction and risk of bias assessm...
-
[17]
Z. Li, Y. Yu, W. Gu, T. Zhu, H. Song, W. Guo, X. Yang, Z. Zhu, Dual-llm adversarial framework for information extraction from research literature, bioRxiv (2025). URL: https: //www.biorxiv.org/content/early/2025/09/16/2025.09.11.675507. doi: 10.1101/2025.09.11.675507. arXiv:https://www.biorxiv.org/content/early/2025/09/16/2025.09.11.675507.full.pdf
-
[18]
J. Barrow, R. Patel, M. Kharkovski, B. Davies, R. Schmitt, Safepassage: High-fidelity information extraction with black box llms, CoRR abs/2510.00276 (2025). URL: https://doi.org/10.48550/arXiv.2510.00276. doi:10.48550/ARXIV. 2510.00276.arXiv:2510.00276
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.