pith. sign in

arxiv: 2606.25057 · v1 · pith:PMDR7RJInew · submitted 2026-06-23 · 💻 cs.CL

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

Pith reviewed 2026-06-25 23:37 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM-based peer reviewcritique generationscore predictionrobustness risksprompt injectionscientific evaluationAI-assisted reviewreliability challenges
0
0 comments X

The pith

LLMs can generate fluent peer-review critiques and approximate scores, but their reliability, robustness, and security as decision-support tools remain insufficiently understood.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey maps how large language models are applied to two main peer-review tasks: producing written critiques and predicting numeric scores. It organizes methods into prompt-based, supervised, retrieval-augmented, and alignment-optimized categories, then compares them against existing benchmarks. The analysis flags dataset limits, domain biases, and evaluation gaps that make current results hard to generalize. It further identifies concrete manipulation threats such as prompt injection, data poisoning, and reward hacking that could undermine automated pipelines. A reader would care because peer review underpins scientific quality; if these gaps persist, scaling review with LLMs risks introducing new forms of error or bias rather than solving the volume problem.

Core claim

The paper claims that although LLMs produce fluent critiques and approximate reviewer scores in current studies, their reliability, robustness, and security as decision-support systems remain insufficiently understood. It delivers a taxonomy of modeling approaches and synthesizes findings across benchmarks while highlighting dataset constraints, domain concentration biases, and emerging robustness risks including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking. From a data-mining viewpoint it flags open challenges in modeling subjective disagreement and cross-domain generalization, reframing automated peer review as a high-stakes multi-objective decision probl

What carries the argument

Structured taxonomy of modeling approaches for critique generation and score prediction, covering prompt-based, supervised, retrieval-augmented, and alignment-optimized methods.

If this is right

  • Existing benchmarks suffer from dataset constraints and domain concentration biases that limit assessment of generalization.
  • Automated review pipelines are exposed to strategic manipulation via prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking.
  • Key open challenges remain in modeling subjective disagreement among reviewers and achieving cross-domain generalization.
  • Reframing peer review as a high-stakes multi-objective decision problem is required to develop trustworthy AI-assisted evaluation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Without explicit security measures, widespread adoption of LLM review could allow bad actors to influence which papers are accepted.
  • Hybrid systems that combine LLM drafts with human oversight and adversarial testing may be needed before deployment at scale.
  • Live experiments that insert controlled attacks into real review workflows could quantify the practical size of the identified risks.

Load-bearing premise

The body of existing studies and benchmarks provides a sufficiently representative sample to support a comprehensive taxonomy, identification of domain biases, and conclusions about robustness risks across the field.

What would settle it

A controlled study that measures LLM review outputs against human reviewers across multiple scientific domains and finds consistently high reliability with no successful prompt-injection or data-poisoning attacks would contradict the claim of insufficient understanding.

Figures

Figures reproduced from arXiv: 2606.25057 by Thi Huyen Nguyen, Zahra Ahmadi.

Figure 1
Figure 1. Figure 1: Annual submission counts for three conferences in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Peer review pipeline. LLM-based systems are increasingly being investigated for automat￾ing two fundamental components of peer review reports: (1) textual critique generation and (2) quantitative score prediction. These two components are central to editorial and program-committee decision-making. Critiques articulate structured assessments of a manuscript’s strengths and weaknesses, while scores translate… view at source ↗
Figure 3
Figure 3. Figure 3: Fraction of reviews detected as LLM-generated by [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Taxonomy of automated peer review generation, categorized by different aspects. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent studies show that LLMs can generate fluent critiques and approximate reviewer scores, their reliability, robustness, and security as decision-support systems remain insufficiently understood. This survey offers a systems-level analysis of LLM-based scientific peer review, focusing on two core evaluative functions: critique generation and score prediction. We present a structured taxonomy of modeling approaches (including prompt-based, supervised, retrieval-augmented, and alignment-optimized approaches), and synthesize empirical findings across existing benchmarks. We analyze dataset constraints, evaluation shortcomings, and domain concentration biases that limit current assessment practices. Beyond performance metrics, we identify emerging robustness risks, including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking, which expose automated review pipelines to strategic manipulation. From a data mining perspective, we outline key open challenges in modeling subjective disagreement and cross-domain generalization. By reframing automated peer review as a high-stakes, multi-objective decision problem, this survey provides a roadmap for developing robust, transparent, and trustworthy AI-assisted scientific evaluation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper is a survey on LLM-based scientific peer review, focusing on critique generation and score prediction. It presents a taxonomy of approaches (prompt-based, supervised, retrieval-augmented, alignment-optimized), synthesizes empirical findings from benchmarks, analyzes dataset constraints, evaluation shortcomings, and domain concentration biases, identifies robustness risks (prompt injection, data poisoning, retrieval vulnerabilities, reward hacking), discusses challenges in modeling subjective disagreement and cross-domain generalization from a data-mining perspective, and provides a roadmap for robust AI-assisted evaluation systems.

Significance. If the underlying literature synthesis holds, the survey makes a timely contribution by framing automated peer review as a high-stakes multi-objective problem and cataloging specific risks and limitations that current benchmarks fail to address. It gives credit to the structured taxonomy and the explicit call-out of manipulation vectors (e.g., prompt injection) that expose decision-support pipelines. The work could usefully guide future benchmark design, though its influence depends on the representativeness of the reviewed studies.

major comments (2)
  1. [Abstract / survey methodology] Abstract and survey methodology section: no search strategy, inclusion criteria, database sources, or quantitative synthesis protocol (e.g., PRISMA-style) are described. This is load-bearing for the central claims that the taxonomy captures 'domain concentration biases' and that robustness risks are 'insufficiently understood' across the field; without these details the representativeness of the selected benchmarks cannot be assessed and the synthesis risks over-generalizing from CS/NLP-centric studies.
  2. [Empirical findings synthesis] Section synthesizing empirical findings: the claims about cross-domain generalization failures and pipeline-specific manipulation risks rest on the same unverified sampling assumption. If the reviewed benchmarks are skewed toward narrow review formats or particular venues, the identified 'domain concentration biases' and the risk catalog may not generalize, weakening the roadmap for trustworthy systems.
minor comments (2)
  1. [Abstract] The abstract states that LLMs 'approximate reviewer scores' but does not clarify whether this refers to correlation with human scores, ranking accuracy, or other metrics; a brief definition would improve precision.
  2. [Robustness risks] Several risk categories (prompt injection, data poisoning) are listed without citing the specific peer-review benchmarks or studies that demonstrate them in this domain; adding those references would strengthen the synthesis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on methodological transparency and the generalizability of our synthesis. We address each major comment below and will make targeted revisions to improve clarity without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract / survey methodology] Abstract and survey methodology section: no search strategy, inclusion criteria, database sources, or quantitative synthesis protocol (e.g., PRISMA-style) are described. This is load-bearing for the central claims that the taxonomy captures 'domain concentration biases' and that robustness risks are 'insufficiently understood' across the field; without these details the representativeness of the selected benchmarks cannot be assessed and the synthesis risks over-generalizing from CS/NLP-centric studies.

    Authors: We agree this detail is needed for assessing representativeness. In revision we will insert a dedicated 'Survey Methodology' subsection describing the literature search process: databases queried (arXiv, ACL Anthology, Semantic Scholar, Google Scholar), search keywords and date range (post-2022 papers on LLM peer review), inclusion criteria (focus on critique generation or score prediction with empirical evaluation), and exclusion criteria. We will also note that this is a narrative survey of an emerging area rather than a PRISMA-compliant systematic review, and will add an explicit limitations paragraph on potential coverage gaps. These additions will directly support the domain-bias claims by making the sampling basis transparent. revision: yes

  2. Referee: [Empirical findings synthesis] Section synthesizing empirical findings: the claims about cross-domain generalization failures and pipeline-specific manipulation risks rest on the same unverified sampling assumption. If the reviewed benchmarks are skewed toward narrow review formats or particular venues, the identified 'domain concentration biases' and the risk catalog may not generalize, weakening the roadmap for trustworthy systems.

    Authors: We accept the point that claims must be qualified by the underlying sample. We will revise the empirical synthesis and roadmap sections to (1) tabulate the venues and domains of the reviewed benchmarks, (2) explicitly state that observed generalization failures and manipulation vectors are drawn primarily from CS/NLP studies, and (3) add a forward-looking paragraph recommending construction of cross-domain benchmarks. This keeps the risk catalog intact while preventing over-generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: survey synthesizes external benchmarks without self-referential reductions

full rationale

This is a survey paper presenting a taxonomy and synthesis of existing LLM peer-review studies and benchmarks. No derivations, equations, fitted parameters, or predictions appear that reduce by construction to the authors' own inputs. Claims rest on analysis of external literature; the representativeness concern raised in the skeptic note is a sampling/correctness issue, not a circularity reduction. No self-citation chains or ansatzes are load-bearing in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The survey rests on the domain assumption that existing empirical studies on LLM critique generation and score prediction are adequate to identify general patterns, biases, and risks; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits
    Opening motivation stated in the abstract.
  • domain assumption LLMs can be evaluated on two core functions: critique generation and score prediction
    Central framing of the survey scope in the abstract.

pith-pipeline@v0.9.1-grok · 5735 in / 1272 out tokens · 28658 ms · 2026-06-25T23:37:54.442576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 27 canonical work pages · 6 internal anchors

  1. [1]

    Aaai launches ai-powered peer review assessment system

    AAAI. Aaai launches ai-powered peer review assessment system. https://aaai.org/ aaai-launches-ai-powered-peer-review-assessment-system/, 2025. Published: 2025-05-16

  2. [2]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Banerjee and A

    S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

  4. [4]

    P. K. Bharti, V. Dalal, and M. Panchal. Co-reviewer: can ai review like a human? an agentic framework for llm-human alignment in peer review.Scientometrics, pages 1–42, 2026

  5. [5]

    P. K. Bharti, S. Ranjan, T. Ghosal, M. Agrawal, and A. Ekbal. Peerassist: leveraging on paper-review interactions to predict peer review decisions. InTowards Open and Trustworthy Digital Societies: 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings 23, pages 421–435. Springer, 2021

  6. [6]

    Biggio and F

    B. Biggio and F. Roli. Wild patterns: Ten years after the rise of adversarial machine learning. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 2154–2156, 2018

  7. [7]

    Biswas, D

    S. Biswas, D. Dobaria, and H. L. Cohen. Chatgpt and the future of journal reviews: a feasibility study.The Yale Journal of Biology and Medicine, 96(3):415, 2023

  8. [8]

    Bornmann

    L. Bornmann. Scientific peer review.Annual review of information science and technology, 45(1):197–245, 2011

  9. [9]

    Chai and R

    T. Chai and R. R. Draxler. Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature.Geoscientific model development, 7(3):1247–1250, 2014

  10. [10]

    S. Chen, D. Brumby, and A. Cox. Envisioning the future of peer review: Inves- tigating llm-assisted reviewing using chatgpt as a case study. InProceedings of the 4th Annual Symposium on Human-Computer Interaction for Work, pages 1–18, 2025

  11. [11]

    D’Arcy, T

    M. D’Arcy, T. Hope, L. Birnbaum, and D. Downey. Marg: Multi-agent review generation for scientific papers.arXiv preprint arXiv:2401.04259, 2024

  12. [12]

    J. Du, Y. Wang, W. Zhao, Z. Deng, S. Liu, R. Lou, H. P. Zou, P. N. Venkit, N. Zhang, M. Srinath, et al. Llms assist nlp researchers: Critique paper (meta-) reviewing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  13. [13]

    Dycke, I

    N. Dycke, I. Kuznetsov, and I. Gurevych. Nlpeer: A unified resource for the computational study of peer review. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

  14. [14]

    A. R. B. M. Faizullah, A. Urlana, and R. Mishra. Limgen: Probing the llms for generating suggestive limitations of research papers. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 106–124. Springer, 2024

  15. [15]

    X. Gao, J. Ruan, J. Gao, T. Liu, and Y. Fu. Reviewagents: Bridging the gap between human and ai-generated paper reviews.arXiv preprint arXiv:2503.08506, 2025

  16. [16]

    Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, H. Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

  17. [17]

    Z. Gao, K. Brantley, and T. Joachims. Reviewer2: Optimizing review generation through prompt generation.arXiv preprint arXiv:2402.10886, 2024

  18. [18]

    Hosseini and S

    M. Hosseini and S. P. Horbach. Fighting reviewer fatigue or amplifying bias? considerations and recommendations for use of chatgpt and other large language models in scholarly peer review.Research integrity and peer review, 8(1):4, 2023

  19. [19]

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  20. [20]

    X. Hua, M. Nikolov, N. Badugu, and L. Wang. Argument mining for understanding peer reviews. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics), 2019

  21. [21]

    Huang, W

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

  22. [22]

    Leveraging llm feedback to enhance review quality

    ICLR. Leveraging llm feedback to enhance review quality. https://blog.iclr.cc/ 2025/04/15/leveraging-llm-feedback-to-enhance-review-quality/, 2025. Pub- lished: 2025-04-15

  23. [23]

    Idahl and Z

    M. Idahl and Z. Ahmadi. Openreviewer: A specialized large language model for generating critical scientific paper reviews. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025

  24. [24]

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

  25. [25]

    Y. Jin, Q. Zhao, Y. Wang, H. Chen, K. Zhu, Y. Xiao, and J. Wang. Agentreview: Exploring peer review dynamics with llm agents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  26. [26]

    D. Kang, W. Ammar, B. Dalvi, M. Van Zuylen, S. Kohlmeier, E. Hovy, and R. Schwartz. A dataset of peer reviews (peerread): Collection, insights and nlp applications. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

  27. [27]

    J. Keuper. Prompt injection attacks on llm generated reviews of scientific publi- cations.arXiv preprint arXiv:2509.10248, 2025

  28. [28]

    Kojima, S

    T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

  29. [29]

    Künzli, A

    N. Künzli, A. Berger, K. Czabanowska, R. Lucas, A. Madarasova Geckova, S. Mantwill, and O. von Dem Knesebeck. «i do not have time»—is this the end of peer review in public health sciences?Public health reviews, 43:1605407, 2022

  30. [30]

    Kuznetsov, O

    I. Kuznetsov, O. M. Afzal, K. Dercksen, N. Dycke, A. Goldberg, T. Hope, D. Hovy, J. K. Kummerfeld, A. Lauscher, K. Leyton-Brown, et al. What can natural language processing do for peer review?arXiv preprint arXiv:2405.06563, 2024

  31. [31]

    G. R. Latona, M. H. Ribeiro, T. R. Davidson, V. Veselovsky, and R. West. The ai review lottery: Widespread ai-assisted peer reviews boost paper scores and acceptance rates.arXiv preprint arXiv:2405.02150, 2024

  32. [32]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  33. [33]

    J. Li, W. X. Zhao, J.-R. Wen, and Y. Song. Generating long and informative reviews with aspect-aware coarse-to-fine decoding. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1969–1979, 2019

  34. [34]

    Liang, Y

    W. Liang, Y. Zhang, H. Cao, B. Wang, D. Y. Ding, X. Yang, K. Vodrahalli, S. He, D. S. Smith, Y. Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8):AIoa2400196, 2024

  35. [35]

    C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  36. [36]

    J. Lin, J. Song, Z. Zhou, Y. Chen, and X. Shi. Moprd: A multidisciplinary open peer review dataset.Neural Computing and Applications, 35(34):24191–24206, 2023

  37. [37]

    P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

  38. [38]

    Z. Luo, Z. Yang, Z. Xu, W. Yang, and X. Du. Llm4sr: A survey on large language models for scientific research.arXiv preprint arXiv:2501.04306, 2025

  39. [39]

    Markhasin

    E. Markhasin. Ai-driven scholarly peer review via persistent workflow prompting, meta-prompting, and meta-reasoning.arXiv preprint arXiv:2505.03332, 2025

  40. [40]

    S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettle- moyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

  41. [41]

    Are We Truly Innovating? A Qualitative and Quantitative Study of Originality in AI Research Papers

    A. Mostafa, T. H. Nguyen, and Z. Ahmadi. What is novel? a knowledge- driven framework for bias-aware literature originality evaluation.arXiv preprint arXiv:2602.06054, 2026

  42. [42]

    Mulligan, L

    A. Mulligan, L. Hall, and E. Raphael. Peer review in a changing world: An inter- national study measuring the attitudes of researchers.Journal of the American Society for Information Science and Technology, 64(1):132–161, 2013

  43. [43]

    Chatgpt (mar 14 version)

    OpenAI. Chatgpt (mar 14 version). https://chat.openai.com, 2023. Accessed: 2025-04-29

  44. [44]

    Issues of ai and academic transparency

    OPUS Project Consortium. Issues of ai and academic transparency. https://opusproject.eu/openscience-news/issues-of-ai-and-academic- transparency/?utm_source=chatgpt.com, 2024. published: 2024-05-03

  45. [45]

    Paper copilot statistics

    Paper Copilot. Paper copilot statistics. https://papercopilot.com/statistics/, 2026

  46. [46]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  47. [47]

    Pataranutaporn, N

    P. Pataranutaporn, N. Powdthavee, and P. Maes. Can ai solve the peer review crisis? a large scale experiment on llm’s performance and biases in evaluating ACM SIGKDD Explorations Newsletter, June 2026, Woodstock, NY Nguyen and Ahmadi economics papers.arXiv preprint arXiv:2502.00070, 2025

  48. [48]

    E. L. Pier, M. Brauer, A. Filut, A. Kaatz, J. Raclaw, M. J. Nathan, C. E. Ford, and M. Carnes. Low agreement among reviewers evaluating the same nih grant applications.Proceedings of the National Academy of Sciences, 115(12):2952–2957, 2018

  49. [49]

    Price and P

    S. Price and P. A. Flach. Computational support for academic peer review: A perspective from artificial intelligence.Communications of the ACM, 60(3):70–79, 2017

  50. [50]

    Robertson

    Z. Robertson. Gpt4 is slightly helpful for peer-review assistance: A pilot study. arXiv preprint arXiv:2307.05492, 2023

  51. [51]

    A. Saad, N. Jenko, S. Ariyaratne, N. Birch, K. P. Iyengar, A. M. Davies, R. Vaishya, and R. Botchu. Exploring the potential of chatgpt in the peer review process: an observational study.Diabetes & Metabolic Syndrome: Clinical Research & Reviews, 18(2):102946, 2024

  52. [52]

    Stahl, L

    M. Stahl, L. Biermann, A. Nehring, and H. Wachsmuth. Exploring llm prompting strategies for joint essay scoring and feedback generation. InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), 2024

  53. [53]

    Stappen, G

    L. Stappen, G. Rizos, M. Hasan, T. Hain, and B. W. Schuller. Uncertainty-aware machine support for paper reviewing on the interspeech 2019 submission corpus. In21st Annual Conference of the International Speech Communication Association, 2020

  54. [54]

    Sukpanichnant, A

    P. Sukpanichnant, A. Rapberger, and F. Toni. Peerarg: Argumentative peer review with llms.arXiv preprint arXiv:2409.16813, 2024

  55. [55]

    Taechoyotin and D

    P. Taechoyotin and D. Acuna. Remor: Automated peer review generation with llm reasoning and multi-objective reinforcement learning.arXiv preprint arXiv:2505.11718, 2025

  56. [56]

    C. Tan, D. Lyu, S. Li, Z. Gao, J. Wei, S. Ma, Z. Liu, and S. Z. Li. Peer review as a multi-turn and long-context dialogue with role-based interactions.arXiv preprint arXiv:2406.05688, 2024

  57. [57]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  58. [58]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  59. [59]

    Tyser, B

    K. Tyser, B. Segev, G. Longhitano, X.-Y. Zhang, Z. Meeks, J. Lee, U. Garg, N. Belsten, A. Shporer, M. Udell, et al. Ai-driven review systems: evaluating llms in scalable and bias-aware academic reviews.arXiv preprint arXiv:2408.10365, 2024

  60. [60]

    Wang and Y

    Q. Wang and Y. Tan. Grammatical error detection with self attention by pairwise training. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2020

  61. [61]

    Q. Wang, Q. Zeng, L. Huang, K. Knight, H. Ji, and N. F. Rajani. ReviewRobot: Explainable paper review generation based on knowledge synthesis. InProceed- ings of the 13th International Conference on Natural Language Generation, pages 384–397, 2020

  62. [62]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  63. [63]

    Y. Weng, M. Zhu, G. Bao, H. Zhang, J. Wang, Y. Zhang, and L. Yang. Cyclere- searcher: Improving automated research via automated review. 2025

  64. [64]

    C. J. Willmott and K. Matsuura. Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Climate research, 30(1):79–82, 2005

  65. [65]

    R. Ye, X. Pang, J. Chai, J. Chen, Z. Yin, Z. Xiang, X. Dong, J. Shao, and S. Chen. Are we there yet? revealing the risks of utilizing large language models in scholarly peer review.arXiv preprint arXiv:2412.01708, 2024

  66. [66]

    J. Yu, Z. Ding, J. Tan, K. Luo, Z. Weng, C. Gong, L. Zeng, R. Cui, C. Han, Q. Sun, et al. Automated peer reviewing in paper sea: Standardization, evaluation, and analysis.arXiv preprint arXiv:2407.12857, 2024

  67. [67]

    J. Yu, Z. Ding, J. Tan, K. Luo, Z. Weng, C. Gong, L. Zeng, R. Cui, C. Han, Q. Sun, Z. Wu, Y. Lan, and X. Li. Automated peer reviewing in paper SEA: Standardiza- tion, evaluation, and analysis. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

  68. [68]

    S. Yu, M. Luo, A. Madasu, V. Lal, and P. Howard. Is your paper being reviewed by an llm? investigating ai text detectability in peer review.arXiv preprint arXiv:2410.03019, 2024

  69. [69]

    Yuan and P

    W. Yuan and P. Liu. Kid-review: knowledge-guided scientific review generation with oracle pre-training. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11639–11647, 2022

  70. [70]

    W. Yuan, P. Liu, and G. Neubig. Can we automate scientific reviewing?Journal of Artificial Intelligence Research, 75:171–212, 2022

  71. [71]

    Zhang, Z

    D. Zhang, Z. Bao, S. Du, Z. Zhao, K. Zhang, D. Bao, and Y. Yang. Re2: A consistency- ensured dataset for full-stage peer review and multi-turn rebuttal discussions. arXiv preprint arXiv:2505.07920, 2025

  72. [72]

    BERTScore: Evaluating Text Generation with BERT

    T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

  73. [73]

    W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 563–578, 2019

  74. [74]

    Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh. Calibrate before use: Improving few-shot performance of language models. InInternational conference on machine learning, pages 12697–12706. PMLR, 2021

  75. [75]

    R. Zhou, L. Chen, and K. Yu. Is llm a reliable reviewer? a comprehensive evaluation of llm on automatic paper reviewing tasks. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9340–9351, 2024

  76. [76]

    C. Zhu, J. Xiong, R. Ma, Z. Lu, Y. Liu, and L. Li. When your reviewer is an llm: Biases, divergence, and prompt injection risks in peer review.arXiv preprint arXiv:2509.09912, 2025

  77. [77]

    M. Zhu, Y. Weng, L. Yang, and Y. Zhang. Deepreview: Improving llm-based paper review with human-like deep thinking process.arXiv preprint arXiv:2503.08569, 2025

  78. [78]

    Zhuang, J

    Z. Zhuang, J. Chen, H. Xu, Y. Jiang, and J. Lin. Large language models for automated scholarly paper review: A survey.arXiv preprint arXiv:2501.10326, 2025. Received 23 June 2026