LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

Thi Huyen Nguyen; Zahra Ahmadi

arxiv: 2606.25057 · v1 · pith:PMDR7RJInew · submitted 2026-06-23 · 💻 cs.CL

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

Thi Huyen Nguyen , Zahra Ahmadi This is my paper

Pith reviewed 2026-06-25 23:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM-based peer reviewcritique generationscore predictionrobustness risksprompt injectionscientific evaluationAI-assisted reviewreliability challenges

0 comments

The pith

LLMs can generate fluent peer-review critiques and approximate scores, but their reliability, robustness, and security as decision-support tools remain insufficiently understood.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey maps how large language models are applied to two main peer-review tasks: producing written critiques and predicting numeric scores. It organizes methods into prompt-based, supervised, retrieval-augmented, and alignment-optimized categories, then compares them against existing benchmarks. The analysis flags dataset limits, domain biases, and evaluation gaps that make current results hard to generalize. It further identifies concrete manipulation threats such as prompt injection, data poisoning, and reward hacking that could undermine automated pipelines. A reader would care because peer review underpins scientific quality; if these gaps persist, scaling review with LLMs risks introducing new forms of error or bias rather than solving the volume problem.

Core claim

The paper claims that although LLMs produce fluent critiques and approximate reviewer scores in current studies, their reliability, robustness, and security as decision-support systems remain insufficiently understood. It delivers a taxonomy of modeling approaches and synthesizes findings across benchmarks while highlighting dataset constraints, domain concentration biases, and emerging robustness risks including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking. From a data-mining viewpoint it flags open challenges in modeling subjective disagreement and cross-domain generalization, reframing automated peer review as a high-stakes multi-objective decision probl

What carries the argument

Structured taxonomy of modeling approaches for critique generation and score prediction, covering prompt-based, supervised, retrieval-augmented, and alignment-optimized methods.

If this is right

Existing benchmarks suffer from dataset constraints and domain concentration biases that limit assessment of generalization.
Automated review pipelines are exposed to strategic manipulation via prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking.
Key open challenges remain in modeling subjective disagreement among reviewers and achieving cross-domain generalization.
Reframing peer review as a high-stakes multi-objective decision problem is required to develop trustworthy AI-assisted evaluation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Without explicit security measures, widespread adoption of LLM review could allow bad actors to influence which papers are accepted.
Hybrid systems that combine LLM drafts with human oversight and adversarial testing may be needed before deployment at scale.
Live experiments that insert controlled attacks into real review workflows could quantify the practical size of the identified risks.

Load-bearing premise

The body of existing studies and benchmarks provides a sufficiently representative sample to support a comprehensive taxonomy, identification of domain biases, and conclusions about robustness risks across the field.

What would settle it

A controlled study that measures LLM review outputs against human reviewers across multiple scientific domains and finds consistently high reliability with no successful prompt-injection or data-poisoning attacks would contradict the claim of insufficient understanding.

Figures

Figures reproduced from arXiv: 2606.25057 by Thi Huyen Nguyen, Zahra Ahmadi.

**Figure 2.** Figure 2: Peer review pipeline. LLM-based systems are increasingly being investigated for automating two fundamental components of peer review reports: (1) textual critique generation and (2) quantitative score prediction. These two components are central to editorial and program-committee decision-making. Critiques articulate structured assessments of a manuscript’s strengths and weaknesses, while scores translate… view at source ↗

**Figure 3.** Figure 3: Fraction of reviews detected as LLM-generated by [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Taxonomy of automated peer review generation, categorized by different aspects. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent studies show that LLMs can generate fluent critiques and approximate reviewer scores, their reliability, robustness, and security as decision-support systems remain insufficiently understood. This survey offers a systems-level analysis of LLM-based scientific peer review, focusing on two core evaluative functions: critique generation and score prediction. We present a structured taxonomy of modeling approaches (including prompt-based, supervised, retrieval-augmented, and alignment-optimized approaches), and synthesize empirical findings across existing benchmarks. We analyze dataset constraints, evaluation shortcomings, and domain concentration biases that limit current assessment practices. Beyond performance metrics, we identify emerging robustness risks, including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking, which expose automated review pipelines to strategic manipulation. From a data mining perspective, we outline key open challenges in modeling subjective disagreement and cross-domain generalization. By reframing automated peer review as a high-stakes, multi-objective decision problem, this survey provides a roadmap for developing robust, transparent, and trustworthy AI-assisted scientific evaluation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey taxonomizes LLM approaches to critique and scoring in peer review and flags manipulation risks, but provides no details on how the underlying studies were chosen.

read the letter

This survey organizes existing work on using LLMs for scientific peer review into a taxonomy of prompt-based, supervised, retrieval-augmented, and alignment-optimized methods. It synthesizes findings on critique generation and score prediction, then shifts focus to dataset constraints, domain concentration biases, and risks such as prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking.

The paper does a reasonable job framing automated review as a high-stakes multi-objective problem and pulling out the data-mining angle on subjective disagreement and cross-domain generalization. That moves the discussion past raw performance numbers toward practical deployment concerns.

The clear limitation is the absence of any search strategy, inclusion criteria, or quantitative synthesis method. Without those, the taxonomy and the claims about robustness risks and domain biases rest on an unverified sample of the literature. If the cited studies skew toward CS or NLP venues, the identified failure modes may not generalize as broadly as presented.

Researchers building or evaluating AI tools for peer review will find the roadmap and risk catalog useful as an entry point. It is a synthesis rather than new experiments, so its value depends on how well the selection of prior work holds up.

I would send this to peer review. The topic is timely and the risk analysis could steer better system design, even if the survey methods need explicit documentation in revision.

Referee Report

2 major / 2 minor

Summary. The paper is a survey on LLM-based scientific peer review, focusing on critique generation and score prediction. It presents a taxonomy of approaches (prompt-based, supervised, retrieval-augmented, alignment-optimized), synthesizes empirical findings from benchmarks, analyzes dataset constraints, evaluation shortcomings, and domain concentration biases, identifies robustness risks (prompt injection, data poisoning, retrieval vulnerabilities, reward hacking), discusses challenges in modeling subjective disagreement and cross-domain generalization from a data-mining perspective, and provides a roadmap for robust AI-assisted evaluation systems.

Significance. If the underlying literature synthesis holds, the survey makes a timely contribution by framing automated peer review as a high-stakes multi-objective problem and cataloging specific risks and limitations that current benchmarks fail to address. It gives credit to the structured taxonomy and the explicit call-out of manipulation vectors (e.g., prompt injection) that expose decision-support pipelines. The work could usefully guide future benchmark design, though its influence depends on the representativeness of the reviewed studies.

major comments (2)

[Abstract / survey methodology] Abstract and survey methodology section: no search strategy, inclusion criteria, database sources, or quantitative synthesis protocol (e.g., PRISMA-style) are described. This is load-bearing for the central claims that the taxonomy captures 'domain concentration biases' and that robustness risks are 'insufficiently understood' across the field; without these details the representativeness of the selected benchmarks cannot be assessed and the synthesis risks over-generalizing from CS/NLP-centric studies.
[Empirical findings synthesis] Section synthesizing empirical findings: the claims about cross-domain generalization failures and pipeline-specific manipulation risks rest on the same unverified sampling assumption. If the reviewed benchmarks are skewed toward narrow review formats or particular venues, the identified 'domain concentration biases' and the risk catalog may not generalize, weakening the roadmap for trustworthy systems.

minor comments (2)

[Abstract] The abstract states that LLMs 'approximate reviewer scores' but does not clarify whether this refers to correlation with human scores, ranking accuracy, or other metrics; a brief definition would improve precision.
[Robustness risks] Several risk categories (prompt injection, data poisoning) are listed without citing the specific peer-review benchmarks or studies that demonstrate them in this domain; adding those references would strengthen the synthesis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on methodological transparency and the generalizability of our synthesis. We address each major comment below and will make targeted revisions to improve clarity without altering the core contributions.

read point-by-point responses

Referee: [Abstract / survey methodology] Abstract and survey methodology section: no search strategy, inclusion criteria, database sources, or quantitative synthesis protocol (e.g., PRISMA-style) are described. This is load-bearing for the central claims that the taxonomy captures 'domain concentration biases' and that robustness risks are 'insufficiently understood' across the field; without these details the representativeness of the selected benchmarks cannot be assessed and the synthesis risks over-generalizing from CS/NLP-centric studies.

Authors: We agree this detail is needed for assessing representativeness. In revision we will insert a dedicated 'Survey Methodology' subsection describing the literature search process: databases queried (arXiv, ACL Anthology, Semantic Scholar, Google Scholar), search keywords and date range (post-2022 papers on LLM peer review), inclusion criteria (focus on critique generation or score prediction with empirical evaluation), and exclusion criteria. We will also note that this is a narrative survey of an emerging area rather than a PRISMA-compliant systematic review, and will add an explicit limitations paragraph on potential coverage gaps. These additions will directly support the domain-bias claims by making the sampling basis transparent. revision: yes
Referee: [Empirical findings synthesis] Section synthesizing empirical findings: the claims about cross-domain generalization failures and pipeline-specific manipulation risks rest on the same unverified sampling assumption. If the reviewed benchmarks are skewed toward narrow review formats or particular venues, the identified 'domain concentration biases' and the risk catalog may not generalize, weakening the roadmap for trustworthy systems.

Authors: We accept the point that claims must be qualified by the underlying sample. We will revise the empirical synthesis and roadmap sections to (1) tabulate the venues and domains of the reviewed benchmarks, (2) explicitly state that observed generalization failures and manipulation vectors are drawn primarily from CS/NLP studies, and (3) add a forward-looking paragraph recommending construction of cross-domain benchmarks. This keeps the risk catalog intact while preventing over-generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: survey synthesizes external benchmarks without self-referential reductions

full rationale

This is a survey paper presenting a taxonomy and synthesis of existing LLM peer-review studies and benchmarks. No derivations, equations, fitted parameters, or predictions appear that reduce by construction to the authors' own inputs. Claims rest on analysis of external literature; the representativeness concern raised in the skeptic note is a sampling/correctness issue, not a circularity reduction. No self-citation chains or ansatzes are load-bearing in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The survey rests on the domain assumption that existing empirical studies on LLM critique generation and score prediction are adequate to identify general patterns, biases, and risks; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits
Opening motivation stated in the abstract.
domain assumption LLMs can be evaluated on two core functions: critique generation and score prediction
Central framing of the survey scope in the abstract.

pith-pipeline@v0.9.1-grok · 5735 in / 1272 out tokens · 28658 ms · 2026-06-25T23:37:54.442576+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 27 canonical work pages · 6 internal anchors

[1]

Aaai launches ai-powered peer review assessment system

AAAI. Aaai launches ai-powered peer review assessment system. https://aaai.org/ aaai-launches-ai-powered-peer-review-assessment-system/, 2025. Published: 2025-05-16

2025
[2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Banerjee and A

S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005
[4]

P. K. Bharti, V. Dalal, and M. Panchal. Co-reviewer: can ai review like a human? an agentic framework for llm-human alignment in peer review.Scientometrics, pages 1–42, 2026

2026
[5]

P. K. Bharti, S. Ranjan, T. Ghosal, M. Agrawal, and A. Ekbal. Peerassist: leveraging on paper-review interactions to predict peer review decisions. InTowards Open and Trustworthy Digital Societies: 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings 23, pages 421–435. Springer, 2021

2021
[6]

Biggio and F

B. Biggio and F. Roli. Wild patterns: Ten years after the rise of adversarial machine learning. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 2154–2156, 2018

2018
[7]

Biswas, D

S. Biswas, D. Dobaria, and H. L. Cohen. Chatgpt and the future of journal reviews: a feasibility study.The Yale Journal of Biology and Medicine, 96(3):415, 2023

2023
[8]

Bornmann

L. Bornmann. Scientific peer review.Annual review of information science and technology, 45(1):197–245, 2011

2011
[9]

Chai and R

T. Chai and R. R. Draxler. Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature.Geoscientific model development, 7(3):1247–1250, 2014

2014
[10]

S. Chen, D. Brumby, and A. Cox. Envisioning the future of peer review: Inves- tigating llm-assisted reviewing using chatgpt as a case study. InProceedings of the 4th Annual Symposium on Human-Computer Interaction for Work, pages 1–18, 2025

2025
[11]

D’Arcy, T

M. D’Arcy, T. Hope, L. Birnbaum, and D. Downey. Marg: Multi-agent review generation for scientific papers.arXiv preprint arXiv:2401.04259, 2024

work page arXiv 2024
[12]

J. Du, Y. Wang, W. Zhao, Z. Deng, S. Liu, R. Lou, H. P. Zou, P. N. Venkit, N. Zhang, M. Srinath, et al. Llms assist nlp researchers: Critique paper (meta-) reviewing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024
[13]

Dycke, I

N. Dycke, I. Kuznetsov, and I. Gurevych. Nlpeer: A unified resource for the computational study of peer review. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

2023
[14]

A. R. B. M. Faizullah, A. Urlana, and R. Mishra. Limgen: Probing the llms for generating suggestive limitations of research papers. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 106–124. Springer, 2024

2024
[15]

X. Gao, J. Ruan, J. Gao, T. Liu, and Y. Fu. Reviewagents: Bridging the gap between human and ai-generated paper reviews.arXiv preprint arXiv:2503.08506, 2025

work page arXiv 2025
[16]

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, H. Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Z. Gao, K. Brantley, and T. Joachims. Reviewer2: Optimizing review generation through prompt generation.arXiv preprint arXiv:2402.10886, 2024

work page arXiv 2024
[18]

Hosseini and S

M. Hosseini and S. P. Horbach. Fighting reviewer fatigue or amplifying bias? considerations and recommendations for use of chatgpt and other large language models in scholarly peer review.Research integrity and peer review, 8(1):4, 2023

2023
[19]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[20]

X. Hua, M. Nikolov, N. Badugu, and L. Wang. Argument mining for understanding peer reviews. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics), 2019

2019
[21]

Huang, W

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

2025
[22]

Leveraging llm feedback to enhance review quality

ICLR. Leveraging llm feedback to enhance review quality. https://blog.iclr.cc/ 2025/04/15/leveraging-llm-feedback-to-enhance-review-quality/, 2025. Pub- lished: 2025-04-15

2025
[23]

Idahl and Z

M. Idahl and Z. Ahmadi. Openreviewer: A specialized large language model for generating critical scientific paper reviews. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025

2025
[24]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

2023
[25]

Y. Jin, Q. Zhao, Y. Wang, H. Chen, K. Zhu, Y. Xiao, and J. Wang. Agentreview: Exploring peer review dynamics with llm agents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024
[26]

D. Kang, W. Ammar, B. Dalvi, M. Van Zuylen, S. Kohlmeier, E. Hovy, and R. Schwartz. A dataset of peer reviews (peerread): Collection, insights and nlp applications. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

2018
[27]

J. Keuper. Prompt injection attacks on llm generated reviews of scientific publi- cations.arXiv preprint arXiv:2509.10248, 2025

work page arXiv 2025
[28]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

2022
[29]

Künzli, A

N. Künzli, A. Berger, K. Czabanowska, R. Lucas, A. Madarasova Geckova, S. Mantwill, and O. von Dem Knesebeck. «i do not have time»—is this the end of peer review in public health sciences?Public health reviews, 43:1605407, 2022

2022
[30]

Kuznetsov, O

I. Kuznetsov, O. M. Afzal, K. Dercksen, N. Dycke, A. Goldberg, T. Hope, D. Hovy, J. K. Kummerfeld, A. Lauscher, K. Leyton-Brown, et al. What can natural language processing do for peer review?arXiv preprint arXiv:2405.06563, 2024

work page arXiv 2024
[31]

G. R. Latona, M. H. Ribeiro, T. R. Davidson, V. Veselovsky, and R. West. The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates.arXiv preprint arXiv:2405.02150, 2024

work page arXiv 2024
[32]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020
[33]

J. Li, W. X. Zhao, J.-R. Wen, and Y. Song. Generating long and informative reviews with aspect-aware coarse-to-fine decoding. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1969–1979, 2019

1969
[34]

Liang, Y

W. Liang, Y. Zhang, H. Cao, B. Wang, D. Y. Ding, X. Yang, K. Vodrahalli, S. He, D. S. Smith, Y. Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8):AIoa2400196, 2024

2024
[35]

C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004
[36]

J. Lin, J. Song, Z. Zhou, Y. Chen, and X. Shi. Moprd: A multidisciplinary open peer review dataset.Neural Computing and Applications, 35(34):24191–24206, 2023

2023
[37]

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

2023
[38]

Z. Luo, Z. Yang, Z. Xu, W. Yang, and X. Du. Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306, 2025

work page arXiv 2025
[39]

Markhasin

E. Markhasin. Ai-driven scholarly peer review via persistent workflow prompting, meta-prompting, and meta-reasoning.arXiv preprint arXiv:2505.03332, 2025

work page arXiv 2025
[40]

S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettle- moyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

2022
[41]

Are We Truly Innovating? A Qualitative and Quantitative Study of Originality in AI Research Papers

A. Mostafa, T. H. Nguyen, and Z. Ahmadi. What is novel? a knowledge- driven framework for bias-aware literature originality evaluation.arXiv preprint arXiv:2602.06054, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Mulligan, L

A. Mulligan, L. Hall, and E. Raphael. Peer review in a changing world: An inter- national study measuring the attitudes of researchers.Journal of the American Society for Information Science and Technology, 64(1):132–161, 2013

2013
[43]

Chatgpt (mar 14 version)

OpenAI. Chatgpt (mar 14 version). https://chat.openai.com, 2023. Accessed: 2025-04-29

2023
[44]

Issues of ai and academic transparency

OPUS Project Consortium. Issues of ai and academic transparency. https://opusproject.eu/openscience-news/issues-of-ai-and-academic- transparency/?utm_source=chatgpt.com, 2024. published: 2024-05-03

2024
[45]

Paper copilot statistics

Paper Copilot. Paper copilot statistics. https://papercopilot.com/statistics/, 2026

2026
[46]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[47]

Pataranutaporn, N

P. Pataranutaporn, N. Powdthavee, and P. Maes. Can ai solve the peer review crisis? a large scale experiment on llm’s performance and biases in evaluating ACM SIGKDD Explorations Newsletter, June 2026, Woodstock, NY Nguyen and Ahmadi economics papers.arXiv preprint arXiv:2502.00070, 2025

work page arXiv 2026
[48]

E. L. Pier, M. Brauer, A. Filut, A. Kaatz, J. Raclaw, M. J. Nathan, C. E. Ford, and M. Carnes. Low agreement among reviewers evaluating the same nih grant applications.Proceedings of the National Academy of Sciences, 115(12):2952–2957, 2018

2018
[49]

Price and P

S. Price and P. A. Flach. Computational support for academic peer review: A perspective from artificial intelligence.Communications of the ACM, 60(3):70–79, 2017

2017
[50]

Robertson

Z. Robertson. Gpt4 is slightly helpful for peer-review assistance: A pilot study. arXiv preprint arXiv:2307.05492, 2023

work page arXiv 2023
[51]

A. Saad, N. Jenko, S. Ariyaratne, N. Birch, K. P. Iyengar, A. M. Davies, R. Vaishya, and R. Botchu. Exploring the potential of chatgpt in the peer review process: an observational study.Diabetes & Metabolic Syndrome: Clinical Research & Reviews, 18(2):102946, 2024

2024
[52]

Stahl, L

M. Stahl, L. Biermann, A. Nehring, and H. Wachsmuth. Exploring llm prompting strategies for joint essay scoring and feedback generation. InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), 2024

2024
[53]

Stappen, G

L. Stappen, G. Rizos, M. Hasan, T. Hain, and B. W. Schuller. Uncertainty-aware machine support for paper reviewing on the interspeech 2019 submission corpus. In21st Annual Conference of the International Speech Communication Association, 2020

2019
[54]

Sukpanichnant, A

P. Sukpanichnant, A. Rapberger, and F. Toni. Peerarg: Argumentative peer review with llms.arXiv preprint arXiv:2409.16813, 2024

work page arXiv 2024
[55]

Taechoyotin and D

P. Taechoyotin and D. Acuna. Remor: Automated peer review generation with llm reasoning and multi-objective reinforcement learning.arXiv preprint arXiv:2505.11718, 2025

work page arXiv 2025
[56]

C. Tan, D. Lyu, S. Li, Z. Gao, J. Wei, S. Ma, Z. Liu, and S. Z. Li. Peer review as a multi-turn and long-context dialogue with role-based interactions.arXiv preprint arXiv:2406.05688, 2024

work page arXiv 2024
[57]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Tyser, B

K. Tyser, B. Segev, G. Longhitano, X.-Y. Zhang, Z. Meeks, J. Lee, U. Garg, N. Belsten, A. Shporer, M. Udell, et al. Ai-driven review systems: evaluating llms in scalable and bias-aware academic reviews.arXiv preprint arXiv:2408.10365, 2024

work page arXiv 2024
[60]

Wang and Y

Q. Wang and Y. Tan. Grammatical error detection with self attention by pairwise training. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2020

2020
[61]

Q. Wang, Q. Zeng, L. Huang, K. Knight, H. Ji, and N. F. Rajani. ReviewRobot: Explainable paper review generation based on knowledge synthesis. InProceed- ings of the 13th International Conference on Natural Language Generation, pages 384–397, 2020

2020
[62]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[63]

Y. Weng, M. Zhu, G. Bao, H. Zhang, J. Wang, Y. Zhang, and L. Yang. Cyclere- searcher: Improving automated research via automated review. 2025

2025
[64]

C. J. Willmott and K. Matsuura. Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Climate research, 30(1):79–82, 2005

2005
[65]

R. Ye, X. Pang, J. Chai, J. Chen, Z. Yin, Z. Xiang, X. Dong, J. Shao, and S. Chen. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024

work page arXiv 2024
[66]

J. Yu, Z. Ding, J. Tan, K. Luo, Z. Weng, C. Gong, L. Zeng, R. Cui, C. Han, Q. Sun, et al. Automated peer reviewing in paper sea: Standardization, evaluation, and analysis.arXiv preprint arXiv:2407.12857, 2024

work page arXiv 2024
[67]

J. Yu, Z. Ding, J. Tan, K. Luo, Z. Weng, C. Gong, L. Zeng, R. Cui, C. Han, Q. Sun, Z. Wu, Y. Lan, and X. Li. Automated peer reviewing in paper SEA: Standardiza- tion, evaluation, and analysis. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

2024
[68]

S. Yu, M. Luo, A. Madasu, V. Lal, and P. Howard. Is your paper being reviewed by an llm? investigating ai text detectability in peer review.arXiv preprint arXiv:2410.03019, 2024

work page arXiv 2024
[69]

Yuan and P

W. Yuan and P. Liu. Kid-review: knowledge-guided scientific review generation with oracle pre-training. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11639–11647, 2022

2022
[70]

W. Yuan, P. Liu, and G. Neubig. Can we automate scientific reviewing?Journal of Artificial Intelligence Research, 75:171–212, 2022

2022
[71]

Zhang, Z

D. Zhang, Z. Bao, S. Du, Z. Zhao, K. Zhang, D. Bao, and Y. Yang. Re2: A consistency- ensured dataset for full-stage peer review and multi-turn rebuttal discussions. arXiv preprint arXiv:2505.07920, 2025

work page arXiv 2025
[72]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[73]

W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 563–578, 2019

2019
[74]

Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh. Calibrate before use: Improving few-shot performance of language models. InInternational conference on machine learning, pages 12697–12706. PMLR, 2021

2021
[75]

R. Zhou, L. Chen, and K. Yu. Is llm a reliable reviewer? a comprehensive evaluation of llm on automatic paper reviewing tasks. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9340–9351, 2024

2024
[76]

C. Zhu, J. Xiong, R. Ma, Z. Lu, Y. Liu, and L. Li. When your reviewer is an llm: Biases, divergence, and prompt injection risks in peer review.arXiv preprint arXiv:2509.09912, 2025

work page arXiv 2025
[77]

M. Zhu, Y. Weng, L. Yang, and Y. Zhang. Deepreview: Improving llm-based paper review with human-like deep thinking process.arXiv preprint arXiv:2503.08569, 2025

work page arXiv 2025
[78]

Zhuang, J

Z. Zhuang, J. Chen, H. Xu, Y. Jiang, and J. Lin. Large language models for automated scholarly paper review: A survey.arXiv preprint arXiv:2501.10326, 2025. Received 23 June 2026

work page arXiv 2025

[1] [1]

Aaai launches ai-powered peer review assessment system

AAAI. Aaai launches ai-powered peer review assessment system. https://aaai.org/ aaai-launches-ai-powered-peer-review-assessment-system/, 2025. Published: 2025-05-16

2025

[2] [2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Banerjee and A

S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005

[4] [4]

P. K. Bharti, V. Dalal, and M. Panchal. Co-reviewer: can ai review like a human? an agentic framework for llm-human alignment in peer review.Scientometrics, pages 1–42, 2026

2026

[5] [5]

P. K. Bharti, S. Ranjan, T. Ghosal, M. Agrawal, and A. Ekbal. Peerassist: leveraging on paper-review interactions to predict peer review decisions. InTowards Open and Trustworthy Digital Societies: 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings 23, pages 421–435. Springer, 2021

2021

[6] [6]

Biggio and F

B. Biggio and F. Roli. Wild patterns: Ten years after the rise of adversarial machine learning. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 2154–2156, 2018

2018

[7] [7]

Biswas, D

S. Biswas, D. Dobaria, and H. L. Cohen. Chatgpt and the future of journal reviews: a feasibility study.The Yale Journal of Biology and Medicine, 96(3):415, 2023

2023

[8] [8]

Bornmann

L. Bornmann. Scientific peer review.Annual review of information science and technology, 45(1):197–245, 2011

2011

[9] [9]

Chai and R

T. Chai and R. R. Draxler. Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature.Geoscientific model development, 7(3):1247–1250, 2014

2014

[10] [10]

S. Chen, D. Brumby, and A. Cox. Envisioning the future of peer review: Inves- tigating llm-assisted reviewing using chatgpt as a case study. InProceedings of the 4th Annual Symposium on Human-Computer Interaction for Work, pages 1–18, 2025

2025

[11] [11]

D’Arcy, T

M. D’Arcy, T. Hope, L. Birnbaum, and D. Downey. Marg: Multi-agent review generation for scientific papers.arXiv preprint arXiv:2401.04259, 2024

work page arXiv 2024

[12] [12]

J. Du, Y. Wang, W. Zhao, Z. Deng, S. Liu, R. Lou, H. P. Zou, P. N. Venkit, N. Zhang, M. Srinath, et al. Llms assist nlp researchers: Critique paper (meta-) reviewing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024

[13] [13]

Dycke, I

N. Dycke, I. Kuznetsov, and I. Gurevych. Nlpeer: A unified resource for the computational study of peer review. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

2023

[14] [14]

A. R. B. M. Faizullah, A. Urlana, and R. Mishra. Limgen: Probing the llms for generating suggestive limitations of research papers. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 106–124. Springer, 2024

2024

[15] [15]

X. Gao, J. Ruan, J. Gao, T. Liu, and Y. Fu. Reviewagents: Bridging the gap between human and ai-generated paper reviews.arXiv preprint arXiv:2503.08506, 2025

work page arXiv 2025

[16] [16]

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, H. Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Z. Gao, K. Brantley, and T. Joachims. Reviewer2: Optimizing review generation through prompt generation.arXiv preprint arXiv:2402.10886, 2024

work page arXiv 2024

[18] [18]

Hosseini and S

M. Hosseini and S. P. Horbach. Fighting reviewer fatigue or amplifying bias? considerations and recommendations for use of chatgpt and other large language models in scholarly peer review.Research integrity and peer review, 8(1):4, 2023

2023

[19] [19]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[20] [20]

X. Hua, M. Nikolov, N. Badugu, and L. Wang. Argument mining for understanding peer reviews. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics), 2019

2019

[21] [21]

Huang, W

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

2025

[22] [22]

Leveraging llm feedback to enhance review quality

ICLR. Leveraging llm feedback to enhance review quality. https://blog.iclr.cc/ 2025/04/15/leveraging-llm-feedback-to-enhance-review-quality/, 2025. Pub- lished: 2025-04-15

2025

[23] [23]

Idahl and Z

M. Idahl and Z. Ahmadi. Openreviewer: A specialized large language model for generating critical scientific paper reviews. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025

2025

[24] [24]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

2023

[25] [25]

Y. Jin, Q. Zhao, Y. Wang, H. Chen, K. Zhu, Y. Xiao, and J. Wang. Agentreview: Exploring peer review dynamics with llm agents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024

[26] [26]

D. Kang, W. Ammar, B. Dalvi, M. Van Zuylen, S. Kohlmeier, E. Hovy, and R. Schwartz. A dataset of peer reviews (peerread): Collection, insights and nlp applications. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

2018

[27] [27]

J. Keuper. Prompt injection attacks on llm generated reviews of scientific publi- cations.arXiv preprint arXiv:2509.10248, 2025

work page arXiv 2025

[28] [28]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

2022

[29] [29]

Künzli, A

N. Künzli, A. Berger, K. Czabanowska, R. Lucas, A. Madarasova Geckova, S. Mantwill, and O. von Dem Knesebeck. «i do not have time»—is this the end of peer review in public health sciences?Public health reviews, 43:1605407, 2022

2022

[30] [30]

Kuznetsov, O

I. Kuznetsov, O. M. Afzal, K. Dercksen, N. Dycke, A. Goldberg, T. Hope, D. Hovy, J. K. Kummerfeld, A. Lauscher, K. Leyton-Brown, et al. What can natural language processing do for peer review?arXiv preprint arXiv:2405.06563, 2024

work page arXiv 2024

[31] [31]

G. R. Latona, M. H. Ribeiro, T. R. Davidson, V. Veselovsky, and R. West. The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates.arXiv preprint arXiv:2405.02150, 2024

work page arXiv 2024

[32] [32]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020

[33] [33]

J. Li, W. X. Zhao, J.-R. Wen, and Y. Song. Generating long and informative reviews with aspect-aware coarse-to-fine decoding. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1969–1979, 2019

1969

[34] [34]

Liang, Y

W. Liang, Y. Zhang, H. Cao, B. Wang, D. Y. Ding, X. Yang, K. Vodrahalli, S. He, D. S. Smith, Y. Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8):AIoa2400196, 2024

2024

[35] [35]

C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004

[36] [36]

J. Lin, J. Song, Z. Zhou, Y. Chen, and X. Shi. Moprd: A multidisciplinary open peer review dataset.Neural Computing and Applications, 35(34):24191–24206, 2023

2023

[37] [37]

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

2023

[38] [38]

Z. Luo, Z. Yang, Z. Xu, W. Yang, and X. Du. Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306, 2025

work page arXiv 2025

[39] [39]

Markhasin

E. Markhasin. Ai-driven scholarly peer review via persistent workflow prompting, meta-prompting, and meta-reasoning.arXiv preprint arXiv:2505.03332, 2025

work page arXiv 2025

[40] [40]

S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettle- moyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

2022

[41] [41]

Are We Truly Innovating? A Qualitative and Quantitative Study of Originality in AI Research Papers

A. Mostafa, T. H. Nguyen, and Z. Ahmadi. What is novel? a knowledge- driven framework for bias-aware literature originality evaluation.arXiv preprint arXiv:2602.06054, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Mulligan, L

A. Mulligan, L. Hall, and E. Raphael. Peer review in a changing world: An inter- national study measuring the attitudes of researchers.Journal of the American Society for Information Science and Technology, 64(1):132–161, 2013

2013

[43] [43]

Chatgpt (mar 14 version)

OpenAI. Chatgpt (mar 14 version). https://chat.openai.com, 2023. Accessed: 2025-04-29

2023

[44] [44]

Issues of ai and academic transparency

OPUS Project Consortium. Issues of ai and academic transparency. https://opusproject.eu/openscience-news/issues-of-ai-and-academic- transparency/?utm_source=chatgpt.com, 2024. published: 2024-05-03

2024

[45] [45]

Paper copilot statistics

Paper Copilot. Paper copilot statistics. https://papercopilot.com/statistics/, 2026

2026

[46] [46]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002

[47] [47]

Pataranutaporn, N

P. Pataranutaporn, N. Powdthavee, and P. Maes. Can ai solve the peer review crisis? a large scale experiment on llm’s performance and biases in evaluating ACM SIGKDD Explorations Newsletter, June 2026, Woodstock, NY Nguyen and Ahmadi economics papers.arXiv preprint arXiv:2502.00070, 2025

work page arXiv 2026

[48] [48]

E. L. Pier, M. Brauer, A. Filut, A. Kaatz, J. Raclaw, M. J. Nathan, C. E. Ford, and M. Carnes. Low agreement among reviewers evaluating the same nih grant applications.Proceedings of the National Academy of Sciences, 115(12):2952–2957, 2018

2018

[49] [49]

Price and P

S. Price and P. A. Flach. Computational support for academic peer review: A perspective from artificial intelligence.Communications of the ACM, 60(3):70–79, 2017

2017

[50] [50]

Robertson

Z. Robertson. Gpt4 is slightly helpful for peer-review assistance: A pilot study. arXiv preprint arXiv:2307.05492, 2023

work page arXiv 2023

[51] [51]

A. Saad, N. Jenko, S. Ariyaratne, N. Birch, K. P. Iyengar, A. M. Davies, R. Vaishya, and R. Botchu. Exploring the potential of chatgpt in the peer review process: an observational study.Diabetes & Metabolic Syndrome: Clinical Research & Reviews, 18(2):102946, 2024

2024

[52] [52]

Stahl, L

M. Stahl, L. Biermann, A. Nehring, and H. Wachsmuth. Exploring llm prompting strategies for joint essay scoring and feedback generation. InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), 2024

2024

[53] [53]

Stappen, G

L. Stappen, G. Rizos, M. Hasan, T. Hain, and B. W. Schuller. Uncertainty-aware machine support for paper reviewing on the interspeech 2019 submission corpus. In21st Annual Conference of the International Speech Communication Association, 2020

2019

[54] [54]

Sukpanichnant, A

P. Sukpanichnant, A. Rapberger, and F. Toni. Peerarg: Argumentative peer review with llms.arXiv preprint arXiv:2409.16813, 2024

work page arXiv 2024

[55] [55]

Taechoyotin and D

P. Taechoyotin and D. Acuna. Remor: Automated peer review generation with llm reasoning and multi-objective reinforcement learning.arXiv preprint arXiv:2505.11718, 2025

work page arXiv 2025

[56] [56]

C. Tan, D. Lyu, S. Li, Z. Gao, J. Wei, S. Ma, Z. Liu, and S. Z. Li. Peer review as a multi-turn and long-context dialogue with role-based interactions.arXiv preprint arXiv:2406.05688, 2024

work page arXiv 2024

[57] [57]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Tyser, B

K. Tyser, B. Segev, G. Longhitano, X.-Y. Zhang, Z. Meeks, J. Lee, U. Garg, N. Belsten, A. Shporer, M. Udell, et al. Ai-driven review systems: evaluating llms in scalable and bias-aware academic reviews.arXiv preprint arXiv:2408.10365, 2024

work page arXiv 2024

[60] [60]

Wang and Y

Q. Wang and Y. Tan. Grammatical error detection with self attention by pairwise training. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2020

2020

[61] [61]

Q. Wang, Q. Zeng, L. Huang, K. Knight, H. Ji, and N. F. Rajani. ReviewRobot: Explainable paper review generation based on knowledge synthesis. InProceed- ings of the 13th International Conference on Natural Language Generation, pages 384–397, 2020

2020

[62] [62]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022

[63] [63]

Y. Weng, M. Zhu, G. Bao, H. Zhang, J. Wang, Y. Zhang, and L. Yang. Cyclere- searcher: Improving automated research via automated review. 2025

2025

[64] [64]

C. J. Willmott and K. Matsuura. Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Climate research, 30(1):79–82, 2005

2005

[65] [65]

R. Ye, X. Pang, J. Chai, J. Chen, Z. Yin, Z. Xiang, X. Dong, J. Shao, and S. Chen. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024

work page arXiv 2024

[66] [66]

J. Yu, Z. Ding, J. Tan, K. Luo, Z. Weng, C. Gong, L. Zeng, R. Cui, C. Han, Q. Sun, et al. Automated peer reviewing in paper sea: Standardization, evaluation, and analysis.arXiv preprint arXiv:2407.12857, 2024

work page arXiv 2024

[67] [67]

J. Yu, Z. Ding, J. Tan, K. Luo, Z. Weng, C. Gong, L. Zeng, R. Cui, C. Han, Q. Sun, Z. Wu, Y. Lan, and X. Li. Automated peer reviewing in paper SEA: Standardiza- tion, evaluation, and analysis. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

2024

[68] [68]

S. Yu, M. Luo, A. Madasu, V. Lal, and P. Howard. Is your paper being reviewed by an llm? investigating ai text detectability in peer review.arXiv preprint arXiv:2410.03019, 2024

work page arXiv 2024

[69] [69]

Yuan and P

W. Yuan and P. Liu. Kid-review: knowledge-guided scientific review generation with oracle pre-training. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11639–11647, 2022

2022

[70] [70]

W. Yuan, P. Liu, and G. Neubig. Can we automate scientific reviewing?Journal of Artificial Intelligence Research, 75:171–212, 2022

2022

[71] [71]

Zhang, Z

D. Zhang, Z. Bao, S. Du, Z. Zhao, K. Zhang, D. Bao, and Y. Yang. Re2: A consistency- ensured dataset for full-stage peer review and multi-turn rebuttal discussions. arXiv preprint arXiv:2505.07920, 2025

work page arXiv 2025

[72] [72]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[73] [73]

W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 563–578, 2019

2019

[74] [74]

Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh. Calibrate before use: Improving few-shot performance of language models. InInternational conference on machine learning, pages 12697–12706. PMLR, 2021

2021

[75] [75]

R. Zhou, L. Chen, and K. Yu. Is llm a reliable reviewer? a comprehensive evaluation of llm on automatic paper reviewing tasks. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9340–9351, 2024

2024

[76] [76]

C. Zhu, J. Xiong, R. Ma, Z. Lu, Y. Liu, and L. Li. When your reviewer is an llm: Biases, divergence, and prompt injection risks in peer review.arXiv preprint arXiv:2509.09912, 2025

work page arXiv 2025

[77] [77]

M. Zhu, Y. Weng, L. Yang, and Y. Zhang. Deepreview: Improving llm-based paper review with human-like deep thinking process.arXiv preprint arXiv:2503.08569, 2025

work page arXiv 2025

[78] [78]

Zhuang, J

Z. Zhuang, J. Chen, H. Xu, Y. Jiang, and J. Lin. Large language models for automated scholarly paper review: A survey.arXiv preprint arXiv:2501.10326, 2025. Received 23 June 2026

work page arXiv 2025