arxiv: 2604.08602 · v1 · submitted 2026-04-08 · 💻 cs.DL · cs.AI· cs.LG

Recognition: no theorem link

TiAb Review Plugin: A Browser-Based Tool for AI-Assisted Title and Abstract Screening

Yuki Kataoka , Masahiro Banno , Michihito Kyo , Shuri Nakao , Tomoo Sato , Shunsuke Taito , Tomohiro Takayama , Takahiro Tsuge

show 3 more authors

Yasushi Tsujimoto Ryuhei So Toshi A. Furukawa

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3

classification 💻 cs.DL cs.AIcs.LG

keywords browser extensionsystematic reviewtitle and abstract screeninglarge language modelactive learningno-codeserverless

0 comments

The pith

A Chrome browser extension enables no-code, serverless AI-assisted screening of titles and abstracts for systematic reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a browser-based tool to make AI help available for the labor-intensive first stage of systematic reviews. It solves the problems of paid server tools and the need for programming skills by running everything in the user's browser with Google Sheets handling shared data and a personal API key for the language model. The extension offers manual, large language model batch, and machine learning active learning modes. Validation shows the machine learning component matches established software exactly and the language model screening finds nearly all relevant records while cutting the workload substantially on test sets. This design aims to let more research groups use modern AI methods without extra infrastructure.

Core claim

We developed a functional browser extension that integrates large language model screening and machine learning active learning into a no-code, serverless environment, ready for practical use in systematic review screening.

What carries the argument

The TiAb Review Plugin browser extension, which stores API keys locally, uses Google Sheets as a shared database for collaboration, and executes both large language model queries and a re-implemented active learning classifier directly in the browser.

Load-bearing premise

Benchmark datasets with low prevalence of relevant studies accurately represent how the tool will perform in actual systematic reviews that differ in topic, language, and reviewer definitions of relevance.

What would settle it

A head-to-head test of the extension against full manual screening on a real-world systematic review project, measuring actual recall, precision, and time savings.

read the original abstract

Background: Server-based screening tools impose subscription costs, while open-source alternatives require coding skills. Objectives: We developed a browser extension that provides no-code, serverless artificial intelligence (AI)-assisted title and abstract screening and examined its functionality. Methods: TiAb Review Plugin is an open-source Chrome browser extension (available at https://chromewebstore.google.com/detail/tiab-review-plugin/alejlnlfflogpnabpbplmnojgoeeabij). It uses Google Sheets as a shared database, requiring no dedicated server and enabling multi-reviewer collaboration. Users supply their own Gemini API key, stored locally and encrypted. The tool offers three screening modes: manual review, large language model (LLM) batch screening, and machine learning (ML) active learning. For ML evaluation, we re-implemented the default ASReview active learning algorithm (TF-IDF with Naive Bayes) in TypeScript to enable in-browser execution, and verified equivalence against the original Python implementation using 10-fold cross-validation on six datasets. For LLM evaluation, we compared 16 parameter configurations across two model families on a benchmark dataset, then validated the optimal configuration (Gemini 3.0 Flash, low thinking budget, TopP=0.95) with a sensitivity-oriented prompt on five public datasets (1,038 to 5,628 records, 0.5 to 2.0 percent prevalence). Results: The TypeScript classifier produced top-100 rankings 100 percent identical to the original ASReview across all six datasets. For LLM screening, recall was 94 to 100 percent with precision of 2 to 15 percent, and Work Saved over Sampling at 95 percent recall (WSS@95) ranged from 48.7 to 87.3 percent. Conclusions: We developed a functional browser extension that integrates LLM screening and ML active learning into a no-code, serverless environment, ready for practical use in systematic review screening.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a working Chrome extension for no-code LLM and ML screening via Google Sheets, with clean benchmark validation on the ML re-implementation and high LLM recall, but the evidence stays inside low-prevalence public datasets.

read the letter

The main takeaway is that the authors built and released a functional browser extension that lets users run LLM batch screening or ML active learning without servers or coding skills. It stores data in Google Sheets for collaboration and keeps the Gemini API key local. That combination is new compared to the server-based or code-heavy tools they cite. They also re-implemented the default ASReview classifier in TypeScript and showed it matches the original Python version exactly on top-100 rankings across six datasets with 10-fold cross-validation. For the LLM side they ran 16 configurations, settled on Gemini 3.0 Flash with a sensitivity prompt, and reported 94-100% recall plus decent work saved over sampling on five public datasets ranging from 1k to 5k records at 0.5-2% prevalence. Those checks are straightforward and reproducible on the benchmarks they used. The tool is open source and already in the Chrome store, so anyone can try it. The soft spot is that everything rests on those same low-prevalence benchmarks. Real systematic reviews vary in topic, language, document length, and how reviewers actually disagree on borderline cases. The paper contains no user studies, no workflow tests inside live reviews, and no checks on how the no-code interface holds up when prevalence or inclusion criteria shift. The claim that the tool is ready for practical use therefore goes a step beyond what the reported numbers directly support. This paper is for teams doing evidence synthesis who want a lightweight AI option without infrastructure. A reader who needs a concrete implementation with basic verification will find it useful. It deserves a serious referee because the development work and benchmark results are clear and the tool itself is available to inspect. I would send it to peer review and ask the authors to add at least one small real-review case or usability note to ground the practical claim.

Referee Report

1 major / 0 minor

Summary. The manuscript presents TiAb Review Plugin, an open-source Chrome browser extension for no-code, serverless AI-assisted title and abstract screening in systematic reviews. It stores data in Google Sheets for collaboration, lets users supply their own encrypted Gemini API key, and supports three modes: manual review, LLM batch screening, and ML active learning. The ML mode reimplements ASReview's TF-IDF + Naive Bayes classifier in TypeScript; equivalence was verified by 10-fold cross-validation on six datasets yielding 100% identical top-100 rankings. LLM screening (Gemini 3.0 Flash, low thinking budget, TopP=0.95, sensitivity-oriented prompt) was tuned on one benchmark then validated on five public datasets (1k–5.6k records, 0.5–2% prevalence), achieving 94–100% recall and WSS@95 of 48.7–87.3%. The central claim is that this produces a functional, practical tool ready for use in systematic review screening.

Significance. If the reported functionality and benchmark performance hold, the work delivers a genuinely accessible alternative that removes server costs and coding requirements, potentially broadening adoption of AI screening. Credit is due for the careful, reproducible verification of the TypeScript re-implementation against the original Python ASReview (identical rankings across all folds and datasets) and for releasing the extension on the Chrome Web Store with open-source code. These elements strengthen the engineering contribution. The high-recall LLM results on public benchmarks are encouraging, though the manuscript's leap from these controlled, low-prevalence settings to broad practical readiness is the point that requires further substantiation.

major comments (1)

[Conclusions] Conclusions (and abstract): The assertion that the tool is 'ready for practical use in systematic review screening' is load-bearing for the central claim yet rests exclusively on benchmark evaluations (0.5–2% prevalence, 1k–5.6k records). No user studies, live workflow integration tests, or evaluations on real systematic reviews that vary in prevalence, topic, language, or inclusion-criteria subjectivity are reported. This gap directly affects whether the no-code/serverless design remains effective outside the tested conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the engineering contributions of the TiAb Review Plugin. We have carefully considered the major comment on the strength of our conclusions and have revised the manuscript accordingly to ensure our claims accurately reflect the scope of the reported evaluations.

read point-by-point responses

Referee: The assertion that the tool is 'ready for practical use in systematic review screening' is load-bearing for the central claim yet rests exclusively on benchmark evaluations (0.5–2% prevalence, 1k–5.6k records). No user studies, live workflow integration tests, or evaluations on real systematic reviews that vary in prevalence, topic, language, or inclusion-criteria subjectivity are reported. This gap directly affects whether the no-code/serverless design remains effective outside the tested conditions.

Authors: We agree that the original phrasing in the Conclusions and Abstract overstates the readiness for broad practical deployment. Our evaluations are indeed limited to benchmark datasets with low prevalence, and we did not conduct user studies, live workflow tests, or assessments on real systematic reviews with varying topics, languages, or subjective inclusion criteria. In the revised manuscript we will: (1) soften the Conclusions and Abstract to state that the tool is a functional, open-source, no-code and serverless implementation that achieves high recall on standard public benchmarks and is therefore available for practical testing and use; (2) add a dedicated Limitations section that explicitly notes the absence of real-world user studies and the need for future validation across diverse prevalence levels, topics, languages, and screening criteria; and (3) retain the benchmark results as evidence of technical feasibility while framing them as an initial demonstration rather than comprehensive proof of effectiveness in all settings. These changes directly address the referee's concern without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper presents a tool-development and benchmark-evaluation workflow with no load-bearing steps that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The TypeScript re-implementation of the ASReview algorithm is verified by direct equivalence testing on independent public datasets via 10-fold cross-validation; LLM configuration selection on one dataset followed by validation on five others follows standard hold-out practice without circular reduction. Central claims rest on externally reproducible comparisons to the original Python ASReview and public benchmark datasets rather than internal re-labeling or ansatz smuggling. The evaluation chain is self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the successful development of the extension and its measured performance. The only notable assumption is equivalence of the browser ML re-implementation, which was directly tested rather than assumed.

free parameters (1)

LLM configuration (Gemini 3.0 Flash, low thinking budget, TopP=0.95)
Selected as optimal after comparing 16 parameter sets on a benchmark dataset

axioms (1)

domain assumption The TypeScript re-implementation of TF-IDF + Naive Bayes produces identical rankings to the original Python ASReview
Invoked when claiming equivalence; supported by 10-fold cross-validation results on six datasets

pith-pipeline@v0.9.0 · 5721 in / 1174 out tokens · 63573 ms · 2026-05-10T18:02:30.740185+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Analysis of the time and workers needed to conduct systematic reviews of medical interven- tions using data from the PROSPERO registry

Borah R, Brown A W, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interven- tions using data from the PROSPERO registry. BMJ Open. 2017 Feb;7(2):e012545. doi: 10.1136/bmjopen-2016-012545

work page doi:10.1136/bmjopen-2016-012545 2017
[2]

The semi-automation of title and abstract screening: A retrospective exploration of ways to leverage abstrackr’s relevance predictions in sys- tematic and rapid reviews

Gates A, Gates M, Sebastianski M, Guitard S, Elliott SA, Hartling L. The semi-automation of title and abstract screening: A retrospective exploration of ways to leverage abstrackr’s relevance predictions in sys- tematic and rapid reviews. BMC Medical Research Methodology. 2020 Jun;20(1). doi: 10.1186/s12874-020-01031-w

work page doi:10.1186/s12874-020-01031-w 2020
[3]

Automated screening of research studies for systematic reviews using study characteristics

Tsafnat G, Glasziou P, Karystianis G, Coiera E. Automated screening of research studies for systematic reviews using study characteristics. Sys- tematic Reviews. 2018 Apr;7(1). doi: 10.1186/s13643-018-0724-7

work page doi:10.1186/s13643-018-0724-7 2018
[4]

DenseReviewer: A screening prioritisation tool for systematic review based on dense retrieval

Mao X, Leelanupab T, Scells H, Zuccon G. DenseReviewer: A screening prioritisation tool for systematic review based on dense retrieval. In: Ad- vances in information retrieval [Internet]. Springer Nature Switzerland
[5]

p. 59–64. A vailable from: http://dx.doi.org/10.1007/978-3-031- 88720-8_11 doi:10.1007/978-3-031-88720-8_11

work page doi:10.1007/978-3-031-
[6]

AISysRev -- LLM-based Tool for Title-abstract Screening

Huotala A, Kuutila M, Turtio OP, Mäntylä M. AISysRev – LLM- based tool for title-abstract screening. arXiv [csSE]. 2025 Oct. doi:10.48550/arXiv.2510.06708

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.06708 2025
[7]

Software tools to sup- port title and abstract screening for systematic reviews in healthcare: An evaluation

Harrison H, Griﬀin SJ, Kuhn I, Usher-Smith JA. Software tools to sup- port title and abstract screening for systematic reviews in healthcare: An evaluation. BMC Medical Research Methodology. 2020 Jan;20(1). doi:10.1186/s12874-020-0897-3 16

work page doi:10.1186/s12874-020-0897-3 2020
[8]

An open source machine learning framework for eﬀicient and transparent systematic reviews

Schoot R van de, Bruin J de, Schram R, Zahedi P, Boer J de, Wei- jdema F, et al. An open source machine learning framework for eﬀicient and transparent systematic reviews. Nature Machine Intelligence. 2021 Feb;3(2):125–33. doi: 10.1038/s42256-020-00287-7

work page doi:10.1038/s42256-020-00287-7 2021
[9]

ASReview LAB — installation documentation [Internet]

ASReview LAB. ASReview LAB — installation documentation [Internet]. A vailable from: https://asreview.readthedocs.io/en/stable/lab/installati on.html
[10]

Systematic review toolbox [Internet]

Systematic Review Toolbox. Systematic review toolbox [Internet]. A vail- able from: https://systematicreviewtools.com/
[11]

Performance of active learning models for screening prioritization in systematic reviews: A simulation study into the average time to discover relevant records

Ferdinands G, Schram R, Bruin J de, Schoot R van de. Performance of active learning models for screening prioritization in systematic reviews: A simulation study into the average time to discover relevant records. Systematic Reviews. 2023 Jun;12(1). doi: 10.1186/s13643-023-02257-7

work page doi:10.1186/s13643-023-02257-7 2023
[12]

Deploying an interactive machine learning system in an evidence-based practice center: abstrackr

Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. In: Proceedings of the 2nd ACM SIGHIT international health informatics symposium [Internet]. ACM; 2012. p. 819–24. (IHI ’12). A vailable from: http://dx.doi.org/10.1145/2110363.2110464 doi:10.1145/2110363.2110464

work page doi:10.1145/2110363.2110464 2012
[13]

Evaluating large language models for title/abstract screening: A systematic review and meta-analysis & development of new tool

Kim JK, Rickard M, Dangle P, Batra N, Chua ME, Khondker A, et al. Evaluating large language models for title/abstract screening: A systematic review and meta-analysis & development of new tool. Journal of Medical Artificial Intelligence [Internet]. 2025;8(0). A vailable from: https://jmai.amegroups.org/article/view/10102

2025
[14]

Automated pa- per screening for clinical reviews using large language models: Data anal- ysis study

Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C. Automated pa- per screening for clinical reviews using large language models: Data anal- ysis study. Journal of Medical Internet Research. 2024 Jan;26:e48996. doi:10.2196/48996

work page doi:10.2196/48996 2024
[15]

Khraisha Q, Put S, Kappenberg J, Warber A, Ostfeld K. Can large lan- guage models replace humans in systematic reviews? Evaluating GPT-4’s eﬀicacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Research Synthesis Methods. 2024. doi:10.1002/jrsm.1715

work page doi:10.1002/jrsm.1715 2024
[16]

Covidence — systematic review software [Internet]

Covidence. Covidence — systematic review software [Internet]. A vailable from: https://www.covidence.org/ 17
[17]

Rayyan — AI-powered tool for systematic literature reviews [Internet]

Rayyan. Rayyan — AI-powered tool for systematic literature reviews [Internet]. A vailable from: https://www.rayyan.ai/
[18]

DistillerSR [Internet]

Evidence Partners. DistillerSR [Internet]. A vailable from: https://www. distillersr.com/
[19]

Elicit — the AI research assistant [Internet]

Elicit. Elicit — the AI research assistant [Internet]. A vailable from: https://elicit.com/
[20]

Statistical stopping criteria for auto- mated screening in systematic reviews

Callaghan MW, Müller-Hansen F. Statistical stopping criteria for auto- mated screening in systematic reviews. Syst Rev. 2020 Nov;9(1):273

2020
[21]

Understanding in vivo models of depression: A systematic review - records of full search

Bannach-Brown A, Liao J, Wegener G, Macloed MR. Understanding in vivo models of depression: A systematic review - records of full search. Zenodo; 2016

2016
[22]

An open competition involving thousands of competitors failed to con- struct useful abstract classifiers for new diagnostic test accuracy sys- tematic reviews

Kataoka Y, Taito S, Yamamoto N, So R, Tsutsumi Y, Anan K, et al. An open competition involving thousands of competitors failed to con- struct useful abstract classifiers for new diagnostic test accuracy sys- tematic reviews. Research Synthesis Methods. 2023 Jun;14(5):707–17. doi:10.1002/jrsm.1649

work page doi:10.1002/jrsm.1649 2023
[23]

Performance of a large language model in screening citations

Oami T, Okada Y, Nakada T. Performance of a large language model in screening citations. JAMA Network Open. 2024 Jul;7(7):e2420496. doi:10.1001/jamanetworkopen.2024.20496

work page doi:10.1001/jamanetworkopen.2024.20496 2024
[24]

Reducing workload in systematic review preparation using automated citation classifica- tion

Cohen AM, Hersh WR, Peterson K, Yen P-Y. Reducing workload in systematic review preparation using automated citation classifica- tion. Journal of the American Medical Informatics Association. 2006 Mar;13(2):206–19. doi: 10.1197/jamia.m1929 18

work page doi:10.1197/jamia.m1929 2006