arxiv: 2604.17667 · v1 · submitted 2026-04-19 · 💻 cs.CL · cs.IR

Recognition: unknown

Peerispect: Claim Verification in Scientific Peer Reviews

Ali Ghorbanpour , Soroush Sadeghian , Alireza Daghighfarsoodeh , Sajad Ebrahimi , Negar Arabzadeh , Seyed Mohammad Hosseini , Ebrahim Bagheri

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:15 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords peer reviewclaim verificationnatural language inferenceinformation retrievalfact checkingscientific publishingevidence retrievalinteractive system

0 comments

The pith

Peerispect extracts check-worthy claims from peer reviews and verifies them against the manuscript using retrieval and natural language inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an interactive system called Peerispect to operationalize claim-level verification in scientific peer reviews. It works by pulling out claims that need checking from the reviews, finding supporting or contradicting evidence in the submitted paper, and then using natural language inference to decide if the claims hold up. This approach addresses the challenge of scale in modern publishing where manual verification of every review statement is impractical. The system includes a visual interface that highlights the evidence right in the paper text for easy review. It is built as a modular pipeline that can swap in different retrieval and verification models and is made available through a demo and API for practical use by reviewers, authors, and committees.

Core claim

Peerispect is presented as a modular information retrieval pipeline that extracts check-worthy claims from peer reviews, retrieves relevant evidence from the manuscript, and verifies the claims through natural language inference, with results displayed through a visual interface that highlights evidence directly in the paper.

What carries the argument

The modular IR pipeline consisting of claim extraction, evidence retrieval, and NLI-based verification, supported by an interactive visual interface.

Load-bearing premise

That current retrievers and natural language inference models can accurately process the specialized and often implicit language used in scientific peer review claims.

What would settle it

A set of peer reviews with manually annotated check-worthy claims and their correct evidence locations and verification outcomes where the system consistently retrieves wrong sections or makes incorrect verification decisions.

Figures

Figures reproduced from arXiv: 2604.17667 by Ali Ghorbanpour, Alireza Daghighfarsoodeh, Ebrahim Bagheri, Negar Arabzadeh, Sajad Ebrahimi, Seyed Mohammad Hosseini, Soroush Sadeghian.

**Figure 2.** Figure 2: Screenshot of the Peerispect interface. The Real World Review Claims (RRC) comprises 150 manually annotated reviewer claims from 25 papers. comprising 150 manually annotated reviewer claims from 25 papers. These complementary datasets allow us to rigorously assess evidence retrieval and verification accuracy, ensuring the demo reflects a tested, reliable system rather than a purely illustrative prototype.… view at source ↗

read the original abstract

Peer review is central to scientific publishing, yet reviewers frequently include claims that are subjective, rhetorical, or misaligned with the submitted work. Assessing whether review statements are factual and verifiable is crucial for fairness and accountability. At the scale of modern conferences and journals, manually inspecting the grounding of such claims is infeasible. We present Peerispect, an interactive system that operationalizes claim-level verification in peer reviews by extracting check-worthy claims from peer reviews, retrieving relevant evidence from the manuscript, and verifying the claims through natural language inference. Results are presented through a visual interface that highlights evidence directly in the paper, enabling rapid inspection and interpretation. Peerispect is designed as a modular Information Retrieval (IR) pipeline, supporting alternative retrievers, rerankers, and verifiers, and is intended for use by reviewers, authors, and program committees. We demonstrate Peerispect through a live, publicly available demo (https://app.reviewer.ly/app/peerispect) and API services (https://github.com/Reviewerly-Inc/Peerispect), accompanied by a video tutorial (https://www.youtube.com/watch?v=pc9RkvkUh14).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Peerispect is a modular demo pipeline that extracts claims from reviews, pulls evidence from the paper, and runs NLI checks, with a public interface but no accuracy numbers.

read the letter

The core of this paper is a working system called Peerispect that turns claim verification into a pipeline: pull check-worthy statements from a review, retrieve matching sections from the manuscript, and apply natural language inference to see if they hold up. They wrapped it in a visual interface that highlights evidence in the paper and released a live demo plus API. That is the main deliverable. The application to peer-review text is new enough; most prior claim-checking work targets news or Wikipedia, so focusing on the often implicit language in reviews and adding the direct highlighting step makes the tool more usable for reviewers and chairs. The modularity is also practical, since it lets people plug in different retrievers or verifiers without rebuilding everything. They made the effort to ship a public demo and tutorial, which shows they built something people can actually try. The main gap is the complete absence of evaluation. The description stops at the pipeline and the interface; there are no precision or recall figures for claim extraction, no retrieval accuracy on real reviews, no human agreement numbers for the NLI step, and no comparison to simple baselines. Without those, we cannot tell whether off-the-shelf models actually work on the specialized phrasing in reviews or whether the system reduces errors in practice. This is a tool paper, not a methods paper, so the missing numbers are not fatal to the claim that the system exists, but they do limit how much weight we can give the work. The paper is aimed at people who build or use tools for scientific publishing, such as program committee members or NLP researchers working on review assistance. A reader who wants to experiment with claim checking in this domain will get value from the demo and the modular design. It deserves peer review because the implementation is concrete and the target problem is real; referees can ask for the missing experiments and comparisons in a revision.

Referee Report

1 major / 2 minor

Summary. The manuscript presents Peerispect, an interactive modular IR system that extracts check-worthy claims from peer reviews, retrieves relevant evidence from the submitted manuscript, and verifies the claims via natural language inference. Results are displayed in a visual interface highlighting evidence in the paper. The work includes a public demo, API services, and a video tutorial, positioning the tool for use by reviewers, authors, and program committees.

Significance. If the pipeline functions reliably on peer-review text, the system could meaningfully support accountability and efficiency in scientific publishing by automating verification of factual grounding at scale. The modular design (allowing alternative retrievers, rerankers, and verifiers) and public release of the demo and API are clear strengths that facilitate adoption and extension.

major comments (1)

Abstract and system description: the central claim that Peerispect 'operationalizes claim-level verification' is presented without any quantitative results, error rates, baseline comparisons, human evaluation, or case studies on scientific peer-review language. This leaves the practical effectiveness of the extraction, retrieval, and NLI steps unassessed and makes it impossible to evaluate the weakest assumption that off-the-shelf or fine-tuned models suffice for implicit review claims.

minor comments (2)

The manuscript would benefit from a dedicated section detailing the specific models or heuristics used for claim extraction and the prompting strategy for NLI, even if modular.
Figure captions and interface screenshots should explicitly label which components (retriever, verifier) are active in each view to improve clarity for readers reproducing the demo.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to better substantiate the system's effectiveness. We address the major comment below and outline planned revisions.

read point-by-point responses

Referee: [—] Abstract and system description: the central claim that Peerispect 'operationalizes claim-level verification' is presented without any quantitative results, error rates, baseline comparisons, human evaluation, or case studies on scientific peer-review language. This leaves the practical effectiveness of the extraction, retrieval, and NLI steps unassessed and makes it impossible to evaluate the weakest assumption that off-the-shelf or fine-tuned models suffice for implicit review claims.

Authors: We acknowledge that the manuscript presents no quantitative benchmarks, error rates, or human evaluations of the pipeline components on peer-review text. The paper's primary contribution is the design of a modular IR system, its public demo, API, and video tutorial, rather than an empirical study of model performance. We do not claim that off-the-shelf models suffice for implicit claims; the system is explicitly designed to allow substitution of retrievers, rerankers, and verifiers, enabling users to integrate stronger models. To address the concern, the revised manuscript will include a new section with qualitative case studies drawn from real peer reviews, explicit discussion of challenges with implicit and rhetorical claims, and a dedicated limitations section. These additions will provide concrete illustrations of the pipeline in action without overstating generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a descriptive system presentation of Peerispect, an interactive IR pipeline for extracting check-worthy claims from reviews, retrieving manuscript evidence, and applying NLI verification, with a public demo and API. It contains no equations, no fitted parameters, no predictions of quantitative results, and no derivation chain. The central claim is simply that the modular system has been built and demonstrated; there are no self-citations, ansatzes, or uniqueness theorems that reduce the argument to its own inputs by construction. This is a standard non-circular engineering/systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper; no free parameters, mathematical axioms, or newly postulated entities are introduced.

pith-pipeline@v0.9.0 · 5535 in / 1043 out tokens · 35309 ms · 2026-05-10T05:15:50.213909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 13 canonical work pages · 4 internal anchors

[1]

Negar Arabzadeh, Sajad Ebrahimi, Ali Ghorbanpour, Soroush Sadeghian, Sara Salamat, Muhan Li, Hai Son Le, Mahdi Bashari, and Ebrahim Bagheri. 2025. Building Trustworthy Peer Review Quality Assessment Systems. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6863–6864. doi:10.1145/3746252.3761436

work page doi:10.1145/3746252.3761436 2025
[2]

Negar Arabzadeh, Sajad Ebrahimi, Soroush Sadeghian, Seyed Mohammad Hos- seini, Alireza Daqiq, Hai Son Le, Mahdi Bashari, and Ebrahim Bagheri. 2026. Can LLMs Uphold Research Integrity? Evaluating the Role of LLMs in Peer Review Quality. InProceedings of the Nineteenth ACM International Conference on Web Search and Data Mining (WSDM ’26). 1341–1342. doi:10....

work page doi:10.1145/3773966.3784970 2026
[3]

Negar Arabzadeh, Sajad Ebrahimi, Sara Salamat, Mahdi Bashari, and Ebrahim Bagheri. 2024. Reviewerly: Modeling the Reviewer Assignment Task as an Information Retrieval Problem. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 5554–5555. doi:10.1145/ 3627673.3679081

work page arXiv 2024
[4]

Ariful Azad and Afeefa Banu. 2024. Publication trends in artificial intelligence conferences: The rise of super prolific authors.arXiv preprint arXiv:2412.07793 (2024)

work page arXiv 2024
[5]

Kirsten Bell, Patricia Kingori, and David Mills. 2024. Scholarly publishing, bound- ary processes, and the problem of fake peer reviews.Science, Technology, & Human Values49, 1 (2024), 78–104

2024
[6]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A Dataset of Information-Seeking Questions and Answers An- chored in Research Papers. (2021). arXiv:2105.03011 [cs.CL] https://arxiv.org/ abs/2105.03011

work page arXiv 2021
[7]

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]

work page internal anchor Pith review arXiv 2024
[8]

John A Drozdz and Michael R Ladomery. 2024. The peer review process: past, present, and future.British Journal of Biomedical Science81 (2024), 12054

2024
[9]

Sajad Ebrahimi, Soroush Sadeghian, Ali Ghorbanpour, Negar Arabzadeh, Sara Salamat, Muhan Li, Hai Son Le, Mahdi Bashari, and Ebrahim Bagheri. 2025. RottenReviews: Benchmarking Review Quality with Human and LLM-Based Judgments. InProceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25). 5642–5649. doi:10.1145/3...

work page doi:10.1145/3746252.3761506 2025
[10]

Sajad Ebrahimi, Sara Salamat, Negar Arabzadeh, Mahdi Bashari, and Ebrahim Bagheri. 2025. exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem. InEuropean Conference on Information Retrieval. Springer, 1–16. doi:10.1007/978-3-031-88714-7_1

work page doi:10.1007/978-3-031-88714-7_1 2025
[11]

Prashant Garg. 2020. Problems in peer review.Journal of Clinical and Diagnostic Research(2020)

2020
[12]

Odest Chadwicke Jenkins and Matthew E. Taylor. 2025. AAAI-26 Review Pro- cess Update: Scale, Integrity Measures, and Experimental Use of AI-Assisted Reviewing

2025
[13]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Mem- ory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

2023
[14]

Carole J Lee, Cassidy R Sugimoto, Guo Zhang, and Blaise Cronin. 2013. Bias in peer review.Journal of the American Society for information Science and Technology64, 1 (2013), 2–17

2013
[15]

Seth S Leopold. 2015. Increased manuscript submissions prompt journals to make hard choices.Clinical Orthopaedics and Related Research®(2015)

2015
[16]

Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR] https://arxiv.org/abs/1901.04085

work page internal anchor Pith review arXiv 2020
[17]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

work page internal anchor Pith review arXiv 2019
[18]

Abdelrahman Sadallah, Tim Baumgärtner, Iryna Gurevych, and Ted Briscoe. 2025. The good, the bad and the constructive: Automatically measuring peer review’s utility for authors. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 28979–29009

2025
[19]

Alessandro Scirè, Karim Ghonim, and Roberto Navigli. 2024. FENICE: Factuality evaluation of summarization based on natural language inference and claim extraction.arXiv preprint arXiv:2403.02270(2024)

work page arXiv 2024
[20]

Richard Smith. 2006. Peer review: a flawed process at the heart of science and journals.Journal of the royal society of medicine99, 4 (2006), 178–182

2006
[21]

Jonathan P Tennant, Jonathan M Dugan, Daniel Graziotin, Damien C Jacques, François Waldner, Daniel Mietchen, Yehia Elkhatib, Lauren B Collister, Christina K Pikas, Tom Crick, et al. 2017. A multi-disciplinary perspective on emergent and future innovations in peer review.F1000Research(2017)

2017
[22]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal
[23]

FEVER: a large-scale dataset for fact extraction and VERification. (2018)

2018
[24]

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. SciFact: A Benchmark for Fact Checking in Scientific Writing. InProceedings of EMNLP

2020
[25]

Theodora Worledge, Tatsunori Hashimoto, and Carlos Guestrin. 2024. The extractive-abstractive spectrum: Uncovering verifiability trade-offs in llm gener- ations.arXiv preprint arXiv:2411.17375(2024)

work page arXiv 2024
[26]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025