Recognition: no theorem link
TiAb Review Plugin: A Browser-Based Tool for AI-Assisted Title and Abstract Screening
Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3
The pith
A Chrome browser extension enables no-code, serverless AI-assisted screening of titles and abstracts for systematic reviews.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We developed a functional browser extension that integrates large language model screening and machine learning active learning into a no-code, serverless environment, ready for practical use in systematic review screening.
What carries the argument
The TiAb Review Plugin browser extension, which stores API keys locally, uses Google Sheets as a shared database for collaboration, and executes both large language model queries and a re-implemented active learning classifier directly in the browser.
Load-bearing premise
Benchmark datasets with low prevalence of relevant studies accurately represent how the tool will perform in actual systematic reviews that differ in topic, language, and reviewer definitions of relevance.
What would settle it
A head-to-head test of the extension against full manual screening on a real-world systematic review project, measuring actual recall, precision, and time savings.
read the original abstract
Background: Server-based screening tools impose subscription costs, while open-source alternatives require coding skills. Objectives: We developed a browser extension that provides no-code, serverless artificial intelligence (AI)-assisted title and abstract screening and examined its functionality. Methods: TiAb Review Plugin is an open-source Chrome browser extension (available at https://chromewebstore.google.com/detail/tiab-review-plugin/alejlnlfflogpnabpbplmnojgoeeabij). It uses Google Sheets as a shared database, requiring no dedicated server and enabling multi-reviewer collaboration. Users supply their own Gemini API key, stored locally and encrypted. The tool offers three screening modes: manual review, large language model (LLM) batch screening, and machine learning (ML) active learning. For ML evaluation, we re-implemented the default ASReview active learning algorithm (TF-IDF with Naive Bayes) in TypeScript to enable in-browser execution, and verified equivalence against the original Python implementation using 10-fold cross-validation on six datasets. For LLM evaluation, we compared 16 parameter configurations across two model families on a benchmark dataset, then validated the optimal configuration (Gemini 3.0 Flash, low thinking budget, TopP=0.95) with a sensitivity-oriented prompt on five public datasets (1,038 to 5,628 records, 0.5 to 2.0 percent prevalence). Results: The TypeScript classifier produced top-100 rankings 100 percent identical to the original ASReview across all six datasets. For LLM screening, recall was 94 to 100 percent with precision of 2 to 15 percent, and Work Saved over Sampling at 95 percent recall (WSS@95) ranged from 48.7 to 87.3 percent. Conclusions: We developed a functional browser extension that integrates LLM screening and ML active learning into a no-code, serverless environment, ready for practical use in systematic review screening.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents TiAb Review Plugin, an open-source Chrome browser extension for no-code, serverless AI-assisted title and abstract screening in systematic reviews. It stores data in Google Sheets for collaboration, lets users supply their own encrypted Gemini API key, and supports three modes: manual review, LLM batch screening, and ML active learning. The ML mode reimplements ASReview's TF-IDF + Naive Bayes classifier in TypeScript; equivalence was verified by 10-fold cross-validation on six datasets yielding 100% identical top-100 rankings. LLM screening (Gemini 3.0 Flash, low thinking budget, TopP=0.95, sensitivity-oriented prompt) was tuned on one benchmark then validated on five public datasets (1k–5.6k records, 0.5–2% prevalence), achieving 94–100% recall and WSS@95 of 48.7–87.3%. The central claim is that this produces a functional, practical tool ready for use in systematic review screening.
Significance. If the reported functionality and benchmark performance hold, the work delivers a genuinely accessible alternative that removes server costs and coding requirements, potentially broadening adoption of AI screening. Credit is due for the careful, reproducible verification of the TypeScript re-implementation against the original Python ASReview (identical rankings across all folds and datasets) and for releasing the extension on the Chrome Web Store with open-source code. These elements strengthen the engineering contribution. The high-recall LLM results on public benchmarks are encouraging, though the manuscript's leap from these controlled, low-prevalence settings to broad practical readiness is the point that requires further substantiation.
major comments (1)
- [Conclusions] Conclusions (and abstract): The assertion that the tool is 'ready for practical use in systematic review screening' is load-bearing for the central claim yet rests exclusively on benchmark evaluations (0.5–2% prevalence, 1k–5.6k records). No user studies, live workflow integration tests, or evaluations on real systematic reviews that vary in prevalence, topic, language, or inclusion-criteria subjectivity are reported. This gap directly affects whether the no-code/serverless design remains effective outside the tested conditions.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the engineering contributions of the TiAb Review Plugin. We have carefully considered the major comment on the strength of our conclusions and have revised the manuscript accordingly to ensure our claims accurately reflect the scope of the reported evaluations.
read point-by-point responses
-
Referee: The assertion that the tool is 'ready for practical use in systematic review screening' is load-bearing for the central claim yet rests exclusively on benchmark evaluations (0.5–2% prevalence, 1k–5.6k records). No user studies, live workflow integration tests, or evaluations on real systematic reviews that vary in prevalence, topic, language, or inclusion-criteria subjectivity are reported. This gap directly affects whether the no-code/serverless design remains effective outside the tested conditions.
Authors: We agree that the original phrasing in the Conclusions and Abstract overstates the readiness for broad practical deployment. Our evaluations are indeed limited to benchmark datasets with low prevalence, and we did not conduct user studies, live workflow tests, or assessments on real systematic reviews with varying topics, languages, or subjective inclusion criteria. In the revised manuscript we will: (1) soften the Conclusions and Abstract to state that the tool is a functional, open-source, no-code and serverless implementation that achieves high recall on standard public benchmarks and is therefore available for practical testing and use; (2) add a dedicated Limitations section that explicitly notes the absence of real-world user studies and the need for future validation across diverse prevalence levels, topics, languages, and screening criteria; and (3) retain the benchmark results as evidence of technical feasibility while framing them as an initial demonstration rather than comprehensive proof of effectiveness in all settings. These changes directly address the referee's concern without requiring new experiments. revision: yes
Circularity Check
No circularity in derivation or evaluation chain
full rationale
The paper presents a tool-development and benchmark-evaluation workflow with no load-bearing steps that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The TypeScript re-implementation of the ASReview algorithm is verified by direct equivalence testing on independent public datasets via 10-fold cross-validation; LLM configuration selection on one dataset followed by validation on five others follows standard hold-out practice without circular reduction. Central claims rest on externally reproducible comparisons to the original Python ASReview and public benchmark datasets rather than internal re-labeling or ansatz smuggling. The evaluation chain is self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- LLM configuration (Gemini 3.0 Flash, low thinking budget, TopP=0.95)
axioms (1)
- domain assumption The TypeScript re-implementation of TF-IDF + Naive Bayes produces identical rankings to the original Python ASReview
Reference graph
Works this paper leans on
-
[1]
Borah R, Brown A W, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interven- tions using data from the PROSPERO registry. BMJ Open. 2017 Feb;7(2):e012545. doi: 10.1136/bmjopen-2016-012545
-
[2]
Gates A, Gates M, Sebastianski M, Guitard S, Elliott SA, Hartling L. The semi-automation of title and abstract screening: A retrospective exploration of ways to leverage abstrackr’s relevance predictions in sys- tematic and rapid reviews. BMC Medical Research Methodology. 2020 Jun;20(1). doi: 10.1186/s12874-020-01031-w
-
[3]
Automated screening of research studies for systematic reviews using study characteristics
Tsafnat G, Glasziou P, Karystianis G, Coiera E. Automated screening of research studies for systematic reviews using study characteristics. Sys- tematic Reviews. 2018 Apr;7(1). doi: 10.1186/s13643-018-0724-7
-
[4]
DenseReviewer: A screening prioritisation tool for systematic review based on dense retrieval
Mao X, Leelanupab T, Scells H, Zuccon G. DenseReviewer: A screening prioritisation tool for systematic review based on dense retrieval. In: Ad- vances in information retrieval [Internet]. Springer Nature Switzerland
-
[5]
p. 59–64. A vailable from: http://dx.doi.org/10.1007/978-3-031- 88720-8_11 doi:10.1007/978-3-031-88720-8_11
-
[6]
AISysRev -- LLM-based Tool for Title-abstract Screening
Huotala A, Kuutila M, Turtio OP, Mäntylä M. AISysRev – LLM- based tool for title-abstract screening. arXiv [csSE]. 2025 Oct. doi:10.48550/arXiv.2510.06708
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.06708 2025
-
[7]
Harrison H, Griffin SJ, Kuhn I, Usher-Smith JA. Software tools to sup- port title and abstract screening for systematic reviews in healthcare: An evaluation. BMC Medical Research Methodology. 2020 Jan;20(1). doi:10.1186/s12874-020-0897-3 16
-
[8]
An open source machine learning framework for efficient and transparent systematic reviews
Schoot R van de, Bruin J de, Schram R, Zahedi P, Boer J de, Wei- jdema F, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nature Machine Intelligence. 2021 Feb;3(2):125–33. doi: 10.1038/s42256-020-00287-7
-
[9]
ASReview LAB — installation documentation [Internet]
ASReview LAB. ASReview LAB — installation documentation [Internet]. A vailable from: https://asreview.readthedocs.io/en/stable/lab/installati on.html
-
[10]
Systematic review toolbox [Internet]
Systematic Review Toolbox. Systematic review toolbox [Internet]. A vail- able from: https://systematicreviewtools.com/
-
[11]
Ferdinands G, Schram R, Bruin J de, Schoot R van de. Performance of active learning models for screening prioritization in systematic reviews: A simulation study into the average time to discover relevant records. Systematic Reviews. 2023 Jun;12(1). doi: 10.1186/s13643-023-02257-7
-
[12]
Deploying an interactive machine learning system in an evidence-based practice center: abstrackr
Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. In: Proceedings of the 2nd ACM SIGHIT international health informatics symposium [Internet]. ACM; 2012. p. 819–24. (IHI ’12). A vailable from: http://dx.doi.org/10.1145/2110363.2110464 doi:10.1145/2110363.2110464
-
[13]
Evaluating large language models for title/abstract screening: A systematic review and meta-analysis & development of new tool
Kim JK, Rickard M, Dangle P, Batra N, Chua ME, Khondker A, et al. Evaluating large language models for title/abstract screening: A systematic review and meta-analysis & development of new tool. Journal of Medical Artificial Intelligence [Internet]. 2025;8(0). A vailable from: https://jmai.amegroups.org/article/view/10102
2025
-
[14]
Automated pa- per screening for clinical reviews using large language models: Data anal- ysis study
Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C. Automated pa- per screening for clinical reviews using large language models: Data anal- ysis study. Journal of Medical Internet Research. 2024 Jan;26:e48996. doi:10.2196/48996
-
[15]
Khraisha Q, Put S, Kappenberg J, Warber A, Ostfeld K. Can large lan- guage models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Research Synthesis Methods. 2024. doi:10.1002/jrsm.1715
-
[16]
Covidence — systematic review software [Internet]
Covidence. Covidence — systematic review software [Internet]. A vailable from: https://www.covidence.org/ 17
-
[17]
Rayyan — AI-powered tool for systematic literature reviews [Internet]
Rayyan. Rayyan — AI-powered tool for systematic literature reviews [Internet]. A vailable from: https://www.rayyan.ai/
-
[18]
DistillerSR [Internet]
Evidence Partners. DistillerSR [Internet]. A vailable from: https://www. distillersr.com/
-
[19]
Elicit — the AI research assistant [Internet]
Elicit. Elicit — the AI research assistant [Internet]. A vailable from: https://elicit.com/
-
[20]
Statistical stopping criteria for auto- mated screening in systematic reviews
Callaghan MW, Müller-Hansen F. Statistical stopping criteria for auto- mated screening in systematic reviews. Syst Rev. 2020 Nov;9(1):273
2020
-
[21]
Understanding in vivo models of depression: A systematic review - records of full search
Bannach-Brown A, Liao J, Wegener G, Macloed MR. Understanding in vivo models of depression: A systematic review - records of full search. Zenodo; 2016
2016
-
[22]
Kataoka Y, Taito S, Yamamoto N, So R, Tsutsumi Y, Anan K, et al. An open competition involving thousands of competitors failed to con- struct useful abstract classifiers for new diagnostic test accuracy sys- tematic reviews. Research Synthesis Methods. 2023 Jun;14(5):707–17. doi:10.1002/jrsm.1649
-
[23]
Performance of a large language model in screening citations
Oami T, Okada Y, Nakada T. Performance of a large language model in screening citations. JAMA Network Open. 2024 Jul;7(7):e2420496. doi:10.1001/jamanetworkopen.2024.20496
-
[24]
Reducing workload in systematic review preparation using automated citation classifica- tion
Cohen AM, Hersh WR, Peterson K, Yen P-Y. Reducing workload in systematic review preparation using automated citation classifica- tion. Journal of the American Medical Informatics Association. 2006 Mar;13(2):206–19. doi: 10.1197/jamia.m1929 18
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.