pith. machine review for the scientific record. sign in

arxiv: 2604.08602 · v1 · submitted 2026-04-08 · 💻 cs.DL · cs.AI· cs.LG

Recognition: no theorem link

TiAb Review Plugin: A Browser-Based Tool for AI-Assisted Title and Abstract Screening

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3

classification 💻 cs.DL cs.AIcs.LG
keywords browser extensionsystematic reviewtitle and abstract screeninglarge language modelactive learningno-codeserverless
0
0 comments X

The pith

A Chrome browser extension enables no-code, serverless AI-assisted screening of titles and abstracts for systematic reviews.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a browser-based tool to make AI help available for the labor-intensive first stage of systematic reviews. It solves the problems of paid server tools and the need for programming skills by running everything in the user's browser with Google Sheets handling shared data and a personal API key for the language model. The extension offers manual, large language model batch, and machine learning active learning modes. Validation shows the machine learning component matches established software exactly and the language model screening finds nearly all relevant records while cutting the workload substantially on test sets. This design aims to let more research groups use modern AI methods without extra infrastructure.

Core claim

We developed a functional browser extension that integrates large language model screening and machine learning active learning into a no-code, serverless environment, ready for practical use in systematic review screening.

What carries the argument

The TiAb Review Plugin browser extension, which stores API keys locally, uses Google Sheets as a shared database for collaboration, and executes both large language model queries and a re-implemented active learning classifier directly in the browser.

Load-bearing premise

Benchmark datasets with low prevalence of relevant studies accurately represent how the tool will perform in actual systematic reviews that differ in topic, language, and reviewer definitions of relevance.

What would settle it

A head-to-head test of the extension against full manual screening on a real-world systematic review project, measuring actual recall, precision, and time savings.

read the original abstract

Background: Server-based screening tools impose subscription costs, while open-source alternatives require coding skills. Objectives: We developed a browser extension that provides no-code, serverless artificial intelligence (AI)-assisted title and abstract screening and examined its functionality. Methods: TiAb Review Plugin is an open-source Chrome browser extension (available at https://chromewebstore.google.com/detail/tiab-review-plugin/alejlnlfflogpnabpbplmnojgoeeabij). It uses Google Sheets as a shared database, requiring no dedicated server and enabling multi-reviewer collaboration. Users supply their own Gemini API key, stored locally and encrypted. The tool offers three screening modes: manual review, large language model (LLM) batch screening, and machine learning (ML) active learning. For ML evaluation, we re-implemented the default ASReview active learning algorithm (TF-IDF with Naive Bayes) in TypeScript to enable in-browser execution, and verified equivalence against the original Python implementation using 10-fold cross-validation on six datasets. For LLM evaluation, we compared 16 parameter configurations across two model families on a benchmark dataset, then validated the optimal configuration (Gemini 3.0 Flash, low thinking budget, TopP=0.95) with a sensitivity-oriented prompt on five public datasets (1,038 to 5,628 records, 0.5 to 2.0 percent prevalence). Results: The TypeScript classifier produced top-100 rankings 100 percent identical to the original ASReview across all six datasets. For LLM screening, recall was 94 to 100 percent with precision of 2 to 15 percent, and Work Saved over Sampling at 95 percent recall (WSS@95) ranged from 48.7 to 87.3 percent. Conclusions: We developed a functional browser extension that integrates LLM screening and ML active learning into a no-code, serverless environment, ready for practical use in systematic review screening.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents TiAb Review Plugin, an open-source Chrome browser extension for no-code, serverless AI-assisted title and abstract screening in systematic reviews. It stores data in Google Sheets for collaboration, lets users supply their own encrypted Gemini API key, and supports three modes: manual review, LLM batch screening, and ML active learning. The ML mode reimplements ASReview's TF-IDF + Naive Bayes classifier in TypeScript; equivalence was verified by 10-fold cross-validation on six datasets yielding 100% identical top-100 rankings. LLM screening (Gemini 3.0 Flash, low thinking budget, TopP=0.95, sensitivity-oriented prompt) was tuned on one benchmark then validated on five public datasets (1k–5.6k records, 0.5–2% prevalence), achieving 94–100% recall and WSS@95 of 48.7–87.3%. The central claim is that this produces a functional, practical tool ready for use in systematic review screening.

Significance. If the reported functionality and benchmark performance hold, the work delivers a genuinely accessible alternative that removes server costs and coding requirements, potentially broadening adoption of AI screening. Credit is due for the careful, reproducible verification of the TypeScript re-implementation against the original Python ASReview (identical rankings across all folds and datasets) and for releasing the extension on the Chrome Web Store with open-source code. These elements strengthen the engineering contribution. The high-recall LLM results on public benchmarks are encouraging, though the manuscript's leap from these controlled, low-prevalence settings to broad practical readiness is the point that requires further substantiation.

major comments (1)
  1. [Conclusions] Conclusions (and abstract): The assertion that the tool is 'ready for practical use in systematic review screening' is load-bearing for the central claim yet rests exclusively on benchmark evaluations (0.5–2% prevalence, 1k–5.6k records). No user studies, live workflow integration tests, or evaluations on real systematic reviews that vary in prevalence, topic, language, or inclusion-criteria subjectivity are reported. This gap directly affects whether the no-code/serverless design remains effective outside the tested conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the engineering contributions of the TiAb Review Plugin. We have carefully considered the major comment on the strength of our conclusions and have revised the manuscript accordingly to ensure our claims accurately reflect the scope of the reported evaluations.

read point-by-point responses
  1. Referee: The assertion that the tool is 'ready for practical use in systematic review screening' is load-bearing for the central claim yet rests exclusively on benchmark evaluations (0.5–2% prevalence, 1k–5.6k records). No user studies, live workflow integration tests, or evaluations on real systematic reviews that vary in prevalence, topic, language, or inclusion-criteria subjectivity are reported. This gap directly affects whether the no-code/serverless design remains effective outside the tested conditions.

    Authors: We agree that the original phrasing in the Conclusions and Abstract overstates the readiness for broad practical deployment. Our evaluations are indeed limited to benchmark datasets with low prevalence, and we did not conduct user studies, live workflow tests, or assessments on real systematic reviews with varying topics, languages, or subjective inclusion criteria. In the revised manuscript we will: (1) soften the Conclusions and Abstract to state that the tool is a functional, open-source, no-code and serverless implementation that achieves high recall on standard public benchmarks and is therefore available for practical testing and use; (2) add a dedicated Limitations section that explicitly notes the absence of real-world user studies and the need for future validation across diverse prevalence levels, topics, languages, and screening criteria; and (3) retain the benchmark results as evidence of technical feasibility while framing them as an initial demonstration rather than comprehensive proof of effectiveness in all settings. These changes directly address the referee's concern without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper presents a tool-development and benchmark-evaluation workflow with no load-bearing steps that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The TypeScript re-implementation of the ASReview algorithm is verified by direct equivalence testing on independent public datasets via 10-fold cross-validation; LLM configuration selection on one dataset followed by validation on five others follows standard hold-out practice without circular reduction. Central claims rest on externally reproducible comparisons to the original Python ASReview and public benchmark datasets rather than internal re-labeling or ansatz smuggling. The evaluation chain is self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the successful development of the extension and its measured performance. The only notable assumption is equivalence of the browser ML re-implementation, which was directly tested rather than assumed.

free parameters (1)
  • LLM configuration (Gemini 3.0 Flash, low thinking budget, TopP=0.95)
    Selected as optimal after comparing 16 parameter sets on a benchmark dataset
axioms (1)
  • domain assumption The TypeScript re-implementation of TF-IDF + Naive Bayes produces identical rankings to the original Python ASReview
    Invoked when claiming equivalence; supported by 10-fold cross-validation results on six datasets

pith-pipeline@v0.9.0 · 5721 in / 1174 out tokens · 63573 ms · 2026-05-10T18:02:30.740185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    Analysis of the time and workers needed to conduct systematic reviews of medical interven- tions using data from the PROSPERO registry

    Borah R, Brown A W, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interven- tions using data from the PROSPERO registry. BMJ Open. 2017 Feb;7(2):e012545. doi: 10.1136/bmjopen-2016-012545

  2. [2]

    The semi-automation of title and abstract screening: A retrospective exploration of ways to leverage abstrackr’s relevance predictions in sys- tematic and rapid reviews

    Gates A, Gates M, Sebastianski M, Guitard S, Elliott SA, Hartling L. The semi-automation of title and abstract screening: A retrospective exploration of ways to leverage abstrackr’s relevance predictions in sys- tematic and rapid reviews. BMC Medical Research Methodology. 2020 Jun;20(1). doi: 10.1186/s12874-020-01031-w

  3. [3]

    Automated screening of research studies for systematic reviews using study characteristics

    Tsafnat G, Glasziou P, Karystianis G, Coiera E. Automated screening of research studies for systematic reviews using study characteristics. Sys- tematic Reviews. 2018 Apr;7(1). doi: 10.1186/s13643-018-0724-7

  4. [4]

    DenseReviewer: A screening prioritisation tool for systematic review based on dense retrieval

    Mao X, Leelanupab T, Scells H, Zuccon G. DenseReviewer: A screening prioritisation tool for systematic review based on dense retrieval. In: Ad- vances in information retrieval [Internet]. Springer Nature Switzerland

  5. [5]

    p. 59–64. A vailable from: http://dx.doi.org/10.1007/978-3-031- 88720-8_11 doi:10.1007/978-3-031-88720-8_11

  6. [6]

    AISysRev -- LLM-based Tool for Title-abstract Screening

    Huotala A, Kuutila M, Turtio OP, Mäntylä M. AISysRev – LLM- based tool for title-abstract screening. arXiv [csSE]. 2025 Oct. doi:10.48550/arXiv.2510.06708

  7. [7]

    Software tools to sup- port title and abstract screening for systematic reviews in healthcare: An evaluation

    Harrison H, Griffin SJ, Kuhn I, Usher-Smith JA. Software tools to sup- port title and abstract screening for systematic reviews in healthcare: An evaluation. BMC Medical Research Methodology. 2020 Jan;20(1). doi:10.1186/s12874-020-0897-3 16

  8. [8]

    An open source machine learning framework for efficient and transparent systematic reviews

    Schoot R van de, Bruin J de, Schram R, Zahedi P, Boer J de, Wei- jdema F, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nature Machine Intelligence. 2021 Feb;3(2):125–33. doi: 10.1038/s42256-020-00287-7

  9. [9]

    ASReview LAB — installation documentation [Internet]

    ASReview LAB. ASReview LAB — installation documentation [Internet]. A vailable from: https://asreview.readthedocs.io/en/stable/lab/installati on.html

  10. [10]

    Systematic review toolbox [Internet]

    Systematic Review Toolbox. Systematic review toolbox [Internet]. A vail- able from: https://systematicreviewtools.com/

  11. [11]

    Performance of active learning models for screening prioritization in systematic reviews: A simulation study into the average time to discover relevant records

    Ferdinands G, Schram R, Bruin J de, Schoot R van de. Performance of active learning models for screening prioritization in systematic reviews: A simulation study into the average time to discover relevant records. Systematic Reviews. 2023 Jun;12(1). doi: 10.1186/s13643-023-02257-7

  12. [12]

    Deploying an interactive machine learning system in an evidence-based practice center: abstrackr

    Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. In: Proceedings of the 2nd ACM SIGHIT international health informatics symposium [Internet]. ACM; 2012. p. 819–24. (IHI ’12). A vailable from: http://dx.doi.org/10.1145/2110363.2110464 doi:10.1145/2110363.2110464

  13. [13]

    Evaluating large language models for title/abstract screening: A systematic review and meta-analysis & development of new tool

    Kim JK, Rickard M, Dangle P, Batra N, Chua ME, Khondker A, et al. Evaluating large language models for title/abstract screening: A systematic review and meta-analysis & development of new tool. Journal of Medical Artificial Intelligence [Internet]. 2025;8(0). A vailable from: https://jmai.amegroups.org/article/view/10102

  14. [14]

    Automated pa- per screening for clinical reviews using large language models: Data anal- ysis study

    Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C. Automated pa- per screening for clinical reviews using large language models: Data anal- ysis study. Journal of Medical Internet Research. 2024 Jan;26:e48996. doi:10.2196/48996

  15. [15]

    Khraisha Q, Put S, Kappenberg J, Warber A, Ostfeld K. Can large lan- guage models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Research Synthesis Methods. 2024. doi:10.1002/jrsm.1715

  16. [16]

    Covidence — systematic review software [Internet]

    Covidence. Covidence — systematic review software [Internet]. A vailable from: https://www.covidence.org/ 17

  17. [17]

    Rayyan — AI-powered tool for systematic literature reviews [Internet]

    Rayyan. Rayyan — AI-powered tool for systematic literature reviews [Internet]. A vailable from: https://www.rayyan.ai/

  18. [18]

    DistillerSR [Internet]

    Evidence Partners. DistillerSR [Internet]. A vailable from: https://www. distillersr.com/

  19. [19]

    Elicit — the AI research assistant [Internet]

    Elicit. Elicit — the AI research assistant [Internet]. A vailable from: https://elicit.com/

  20. [20]

    Statistical stopping criteria for auto- mated screening in systematic reviews

    Callaghan MW, Müller-Hansen F. Statistical stopping criteria for auto- mated screening in systematic reviews. Syst Rev. 2020 Nov;9(1):273

  21. [21]

    Understanding in vivo models of depression: A systematic review - records of full search

    Bannach-Brown A, Liao J, Wegener G, Macloed MR. Understanding in vivo models of depression: A systematic review - records of full search. Zenodo; 2016

  22. [22]

    An open competition involving thousands of competitors failed to con- struct useful abstract classifiers for new diagnostic test accuracy sys- tematic reviews

    Kataoka Y, Taito S, Yamamoto N, So R, Tsutsumi Y, Anan K, et al. An open competition involving thousands of competitors failed to con- struct useful abstract classifiers for new diagnostic test accuracy sys- tematic reviews. Research Synthesis Methods. 2023 Jun;14(5):707–17. doi:10.1002/jrsm.1649

  23. [23]

    Performance of a large language model in screening citations

    Oami T, Okada Y, Nakada T. Performance of a large language model in screening citations. JAMA Network Open. 2024 Jul;7(7):e2420496. doi:10.1001/jamanetworkopen.2024.20496

  24. [24]

    Reducing workload in systematic review preparation using automated citation classifica- tion

    Cohen AM, Hersh WR, Peterson K, Yen P-Y. Reducing workload in systematic review preparation using automated citation classifica- tion. Journal of the American Medical Informatics Association. 2006 Mar;13(2):206–19. doi: 10.1197/jamia.m1929 18