arxiv: 2604.14165 · v2 · submitted 2026-03-23 · 💻 cs.CL

Recognition: no theorem link

EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews

Naman Ahuja , Saniya Mulla , Muhammad Ali Khan , Zaryab Bin Riaz , Kaneez Zahra Rubab Khakwani , Mohamad Bassam Sonbol , Irbaz Bin Riaz , Vivek Gupta

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords clinical evidence extractionsystematic reviewsmulti-agent systemsPDF document processinghuman-in-the-loop AIoncology trialsprovenance trackingevidence synthesis

0 comments

The pith

EviSearch uses multi-agent extraction from PDFs to create auditable clinical evidence tables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

EviSearch presents a multi-agent system designed to automatically generate clinical evidence tables from trial PDFs while ensuring each piece of data can be traced back to its source. The system combines agents that query the PDF layout, search for relevant information, and reconcile disagreements by checking original pages. This approach is tested on oncology trial papers where it achieves higher accuracy than standard text parsing methods and offers complete provenance for every extracted cell. The design supports human oversight by allowing clinicians to verify and edit outputs, which in turn generates data to refine the system further. Overall, it targets the reduction of manual work in creating systematic reviews of medical evidence.

Core claim

The central claim is that pairing a PDF-query agent that preserves layout and figures with a retrieval-guided search agent and a reconciliation module for page-level verification enables high-precision extraction of ontology-aligned evidence tables from native PDFs, with full attribution coverage that supports audit and iterative improvement.

What carries the argument

A multi-agent pipeline consisting of a PDF-query agent, a retrieval-guided search agent, and a reconciliation module that enforces verification on agent disagreements.

If this is right

Substantially improves extraction accuracy relative to strong parsed-text baselines on oncology trial papers.
Provides comprehensive attribution coverage for every extracted cell.
Produces structured preference and supervision signals from reconciler decisions and reviewer edits for model improvement.
Accelerates living systematic review workflows by reducing manual curation burden.
Offers a safe, auditable path for integrating LLM-based extraction into evidence synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reconciliation mechanisms may generalize to other clinical document types if the benchmark is expanded.
Provenance logging could enable training of specialized agents for evidence synthesis beyond the current oncology focus.
Integration into broader evidence-based medicine pipelines might lower error rates in meta-analyses.

Load-bearing premise

The multi-agent reconciliation and provenance mechanisms maintain high precision and generalizability beyond the oncology trial papers in the benchmark.

What would settle it

Testing the system on a diverse set of non-oncology clinical papers and observing if accuracy falls below that of parsed-text baselines or if attribution coverage becomes incomplete.

Figures

Figures reproduced from arXiv: 2604.14165 by Irbaz Bin Riaz, Kaneez Zahra Rubab Khakwani, Mohamad Bassam Sonbol, Muhammad Ali Khan, Naman Ahuja, Saniya Mulla, Vivek Gupta, Zaryab Bin Riaz.

**Figure 1.** Figure 1: EviSearch system architecture link: • Demo: https://coral-labasu.github.io/EviSearch/ • Video 2 Related Work Understanding scientific documents has been widely studied in natural language processing and document AI. Domain-adapted pretraining models such as SciBERT improve representation learning for scholarly text (Beltagy et al., 2019), while large-scale corpora like S2ORC enable structured modeling o… view at source ↗

**Figure 2.** Figure 2: Extraction Interface [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Attribution Interface token counts include document tokens at each invocation; this is reflected in the cost analysis (§5.3). 3.4 Search Agent (Agent B) The Search Agent operates over the same column batches using a tool-based agentic loop over the parsed document representation, targeting finegrained evidence in tables, figures, and results sections. It is powered by Gemini-2.5-Flash at temperature 0… view at source ↗

**Figure 4.** Figure 4: Overall extraction performance across evidence modalities. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

We present EviSearch, a multi-agent extraction system that automates the creation of ontology-aligned clinical evidence tables directly from native trial PDFs while guaranteeing per-cell provenance for audit and human verification. EviSearch pairs a PDF-query agent (which preserves rendered layout and figures) with a retrieval-guided search agent and a reconciliation module that forces page-level verification when agents disagree. The pipeline is designed for high-precision extraction across multimodal evidence sources (text, tables, figures) and for generating reviewer-actionable provenance that clinicians can inspect and correct. On a clinician-curated benchmark of oncology trial papers, EviSearch substantially improves extraction accuracy relative to strong parsed-text baselines while providing comprehensive attribution coverage. By logging reconciler decisions and reviewer edits, the system produces structured preference and supervision signals that bootstrap iterative model improvement. EviSearch is intended to accelerate living systematic review workflows, reduce manual curation burden, and provide a safe, auditable path for integrating LLM-based extraction into evidence synthesis pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EviSearch provides a workable multi-agent pipeline for extracting evidence from clinical PDFs while preserving provenance for audits, though the supporting results are limited to one domain and lack detail.

read the letter

The key takeaway is that EviSearch provides a workable multi-agent pipeline for extracting evidence from clinical PDFs while preserving provenance for audits, though the supporting results are limited to one domain and lack detail. The system pairs a layout-preserving PDF query agent with retrieval-guided search and a reconciliation module. This combination is new in how it forces page-level verification on disagreements and logs everything for human review. It handles text, tables, and figures together, which is useful for trial documents. The design for generating supervision signals from edits is also a nice touch for iterative improvement. It does well at addressing the auditability gap in LLM use for evidence synthesis. The provenance guarantees and human correction loop make it more deployable in clinical workflows than pure extraction models. The main weakness is the evaluation scope. Results are reported only on a clinician-curated set of oncology trial papers, with no quantitative details in the abstract and no tests on other specialties or document formats. The assumption that the reconciliation will generalize is untested, which weakens the claim of broad applicability. This is for researchers in medical informatics and systematic review automation. Anyone building tools for evidence tables or concerned with LLM safety in medicine would find the architecture worth examining. It deserves peer review because the problem is important and the approach is concrete, even if more experiments are needed.

Referee Report

2 major / 2 minor

Summary. The paper introduces EviSearch, a multi-agent LLM system that extracts ontology-aligned clinical evidence tables directly from native trial PDFs. It combines a PDF-query agent preserving layout and figures, a retrieval-guided search agent, and a page-level reconciliation module that triggers verification on agent disagreement. The system logs provenance for each cell to support auditing and human correction. On a clinician-curated benchmark of oncology trial papers, the authors claim substantial accuracy gains over strong parsed-text baselines together with comprehensive attribution coverage. The pipeline is positioned to accelerate living systematic reviews and generate supervision signals for iterative model improvement.

Significance. If the reported accuracy gains and provenance guarantees hold under broader testing, the work could reduce manual curation load in evidence synthesis while maintaining auditability required for clinical use. The explicit logging of reconciler decisions and reviewer edits is a constructive mechanism for producing preference data. The emphasis on multimodal sources (text, tables, figures) and per-cell attribution addresses a recognized pain point in systematic review pipelines.

major comments (2)

[Abstract] Abstract: the central claim that EviSearch 'substantially improves extraction accuracy' is presented without any quantitative metrics, baseline construction details, error analysis, or statistical significance tests. This absence makes the magnitude and reliability of the improvement impossible to evaluate from the provided text.
[Evaluation] Evaluation description: the benchmark is restricted to oncology trial papers; no ablation or results are shown for other specialties, observational studies, or non-trial document formats. Because the reconciliation module is presented as domain-agnostic, the lack of cross-domain testing leaves the generalizability of the provenance guarantees unverified.

minor comments (2)

[Abstract] The abstract refers to an 'ontology-aligned' output but does not name the specific ontology or alignment procedure used.
[Figures/Tables] Figure and table captions should explicitly state whether they report precision, recall, or F1 on the clinician-curated benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the corresponding revisions made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that EviSearch 'substantially improves extraction accuracy' is presented without any quantitative metrics, baseline construction details, error analysis, or statistical significance tests. This absence makes the magnitude and reliability of the improvement impossible to evaluate from the provided text.

Authors: We agree that the abstract would be strengthened by including quantitative support for the central claim. In the revised manuscript we have updated the abstract to report the key performance metrics (including F1-score improvements over the parsed-text baselines), a brief description of the baseline construction, and the results of statistical significance testing. These additions are drawn directly from the evaluation section and make the magnitude of the reported gains evaluable from the abstract alone. revision: yes
Referee: [Evaluation] Evaluation description: the benchmark is restricted to oncology trial papers; no ablation or results are shown for other specialties, observational studies, or non-trial document formats. Because the reconciliation module is presented as domain-agnostic, the lack of cross-domain testing leaves the generalizability of the provenance guarantees unverified.

Authors: The referee correctly observes that the current benchmark is limited to oncology trial papers. While this domain was selected because of the availability of high-quality clinician-curated data and its immediate relevance to living systematic reviews, we acknowledge that the absence of cross-domain results leaves the generalizability of the provenance and reconciliation mechanisms unverified. Expanding the benchmark to additional specialties and document types would require substantial new curation effort that lies outside the scope of the present study. In the revision we have therefore added an explicit Limitations section that discusses this restriction, reports additional component ablations performed on the existing oncology benchmark, and outlines concrete plans for future cross-domain evaluation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical system evaluation only

full rationale

The paper presents a multi-agent extraction pipeline and evaluates it via direct accuracy comparison against parsed-text baselines on a clinician-curated oncology-trial benchmark. No equations, derivations, fitted parameters, or self-referential predictions appear anywhere in the manuscript. The central claim rests on external benchmark results rather than any reduction to the system's own inputs or prior self-citations. Generalizability concerns are orthogonal to circularity and do not trigger any of the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that LLM agents can reliably extract multimodal evidence when guided by layout preservation and page-level reconciliation; no free parameters, new physical entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption LLM-based agents can achieve high-precision extraction from native PDFs when layout and figures are preserved and disagreements trigger page verification
Invoked to support the pipeline's accuracy claims on oncology trials

pith-pipeline@v0.9.0 · 5510 in / 1089 out tokens · 37861 ms · 2026-05-15T00:03:10.615313+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: A pretrained language model for scientific text. In Proceedings of EMNLP-IJCNLP

work page 2019
[2]

Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. Turl: Table understanding through representation learning. In Proceedings of VLDB

work page 2020
[3]

Jonathan Herzig, Pawe Krzysztof Nowak, Thomas M \"u ller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. Tapas: Weakly supervised table parsing via pre-training. In Proceedings of ACL

work page 2020
[4]

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. Ocr-free document understanding transformer. In Proceedings of the European Conference on Computer Vision (ECCV)

work page 2022
[5]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of NeurIPS

work page 2020
[6]

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. 2020. S2orc: The semantic scholar open research corpus. In Proceedings of ACL

work page 2020
[7]

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244

work page internal anchor Pith review arXiv 2022
[8]

Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. 2021. Docvqa: A dataset for vqa on document images. In Proceedings of WACV

work page 2021
[9]

Xiangbin Meng, Xiangyu Yan, Kuo Zhang, and 1 others. 2024. https://doi.org/10.1016/j.isci.2024.109713 The application of large language models in medicine: A scoping review . iScience, 27(109713)

work page doi:10.1016/j.isci.2024.109713 2024
[10]

Khapra, and Pratyush Kumar

Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. 2020. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

work page 2020
[11]

Nadkarni, Eyal Klang, and Benjamin S

Mahmud Omar, Girish N. Nadkarni, Eyal Klang, and Benjamin S. Glicksberg. 2024. https://doi.org/10.1371/journal.pdig.0000662 Large language models in medicine: A review of current clinical trials across healthcare applications . PLOS Digital Health, 3(11 November):e0000662

work page doi:10.1371/journal.pdig.0000662 2024
[12]

Wenxuan Wang and 1 others. 2025. A survey of llm-based agents in medicine: How far are we from baymax? arXiv preprint arXiv:2502.11211

work page arXiv 2025
[13]

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of KDD

work page 2020
[14]

Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. Tabert: Pretraining for joint understanding of textual and tabular data. In Proceedings of the Association for Computational Linguistics (ACL)

work page 2020
[15]

Hongjian Zhou and 1 others. 2023. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112

work page arXiv 2023
[16]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[17]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page