LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges
Pith reviewed 2026-06-25 23:37 UTC · model grok-4.3
The pith
LLMs can generate fluent peer-review critiques and approximate scores, but their reliability, robustness, and security as decision-support tools remain insufficiently understood.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that although LLMs produce fluent critiques and approximate reviewer scores in current studies, their reliability, robustness, and security as decision-support systems remain insufficiently understood. It delivers a taxonomy of modeling approaches and synthesizes findings across benchmarks while highlighting dataset constraints, domain concentration biases, and emerging robustness risks including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking. From a data-mining viewpoint it flags open challenges in modeling subjective disagreement and cross-domain generalization, reframing automated peer review as a high-stakes multi-objective decision probl
What carries the argument
Structured taxonomy of modeling approaches for critique generation and score prediction, covering prompt-based, supervised, retrieval-augmented, and alignment-optimized methods.
If this is right
- Existing benchmarks suffer from dataset constraints and domain concentration biases that limit assessment of generalization.
- Automated review pipelines are exposed to strategic manipulation via prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking.
- Key open challenges remain in modeling subjective disagreement among reviewers and achieving cross-domain generalization.
- Reframing peer review as a high-stakes multi-objective decision problem is required to develop trustworthy AI-assisted evaluation systems.
Where Pith is reading between the lines
- Without explicit security measures, widespread adoption of LLM review could allow bad actors to influence which papers are accepted.
- Hybrid systems that combine LLM drafts with human oversight and adversarial testing may be needed before deployment at scale.
- Live experiments that insert controlled attacks into real review workflows could quantify the practical size of the identified risks.
Load-bearing premise
The body of existing studies and benchmarks provides a sufficiently representative sample to support a comprehensive taxonomy, identification of domain biases, and conclusions about robustness risks across the field.
What would settle it
A controlled study that measures LLM review outputs against human reviewers across multiple scientific domains and finds consistently high reliability with no successful prompt-injection or data-poisoning attacks would contradict the claim of insufficient understanding.
Figures
read the original abstract
The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent studies show that LLMs can generate fluent critiques and approximate reviewer scores, their reliability, robustness, and security as decision-support systems remain insufficiently understood. This survey offers a systems-level analysis of LLM-based scientific peer review, focusing on two core evaluative functions: critique generation and score prediction. We present a structured taxonomy of modeling approaches (including prompt-based, supervised, retrieval-augmented, and alignment-optimized approaches), and synthesize empirical findings across existing benchmarks. We analyze dataset constraints, evaluation shortcomings, and domain concentration biases that limit current assessment practices. Beyond performance metrics, we identify emerging robustness risks, including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking, which expose automated review pipelines to strategic manipulation. From a data mining perspective, we outline key open challenges in modeling subjective disagreement and cross-domain generalization. By reframing automated peer review as a high-stakes, multi-objective decision problem, this survey provides a roadmap for developing robust, transparent, and trustworthy AI-assisted scientific evaluation systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a survey on LLM-based scientific peer review, focusing on critique generation and score prediction. It presents a taxonomy of approaches (prompt-based, supervised, retrieval-augmented, alignment-optimized), synthesizes empirical findings from benchmarks, analyzes dataset constraints, evaluation shortcomings, and domain concentration biases, identifies robustness risks (prompt injection, data poisoning, retrieval vulnerabilities, reward hacking), discusses challenges in modeling subjective disagreement and cross-domain generalization from a data-mining perspective, and provides a roadmap for robust AI-assisted evaluation systems.
Significance. If the underlying literature synthesis holds, the survey makes a timely contribution by framing automated peer review as a high-stakes multi-objective problem and cataloging specific risks and limitations that current benchmarks fail to address. It gives credit to the structured taxonomy and the explicit call-out of manipulation vectors (e.g., prompt injection) that expose decision-support pipelines. The work could usefully guide future benchmark design, though its influence depends on the representativeness of the reviewed studies.
major comments (2)
- [Abstract / survey methodology] Abstract and survey methodology section: no search strategy, inclusion criteria, database sources, or quantitative synthesis protocol (e.g., PRISMA-style) are described. This is load-bearing for the central claims that the taxonomy captures 'domain concentration biases' and that robustness risks are 'insufficiently understood' across the field; without these details the representativeness of the selected benchmarks cannot be assessed and the synthesis risks over-generalizing from CS/NLP-centric studies.
- [Empirical findings synthesis] Section synthesizing empirical findings: the claims about cross-domain generalization failures and pipeline-specific manipulation risks rest on the same unverified sampling assumption. If the reviewed benchmarks are skewed toward narrow review formats or particular venues, the identified 'domain concentration biases' and the risk catalog may not generalize, weakening the roadmap for trustworthy systems.
minor comments (2)
- [Abstract] The abstract states that LLMs 'approximate reviewer scores' but does not clarify whether this refers to correlation with human scores, ranking accuracy, or other metrics; a brief definition would improve precision.
- [Robustness risks] Several risk categories (prompt injection, data poisoning) are listed without citing the specific peer-review benchmarks or studies that demonstrate them in this domain; adding those references would strengthen the synthesis.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on methodological transparency and the generalizability of our synthesis. We address each major comment below and will make targeted revisions to improve clarity without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract / survey methodology] Abstract and survey methodology section: no search strategy, inclusion criteria, database sources, or quantitative synthesis protocol (e.g., PRISMA-style) are described. This is load-bearing for the central claims that the taxonomy captures 'domain concentration biases' and that robustness risks are 'insufficiently understood' across the field; without these details the representativeness of the selected benchmarks cannot be assessed and the synthesis risks over-generalizing from CS/NLP-centric studies.
Authors: We agree this detail is needed for assessing representativeness. In revision we will insert a dedicated 'Survey Methodology' subsection describing the literature search process: databases queried (arXiv, ACL Anthology, Semantic Scholar, Google Scholar), search keywords and date range (post-2022 papers on LLM peer review), inclusion criteria (focus on critique generation or score prediction with empirical evaluation), and exclusion criteria. We will also note that this is a narrative survey of an emerging area rather than a PRISMA-compliant systematic review, and will add an explicit limitations paragraph on potential coverage gaps. These additions will directly support the domain-bias claims by making the sampling basis transparent. revision: yes
-
Referee: [Empirical findings synthesis] Section synthesizing empirical findings: the claims about cross-domain generalization failures and pipeline-specific manipulation risks rest on the same unverified sampling assumption. If the reviewed benchmarks are skewed toward narrow review formats or particular venues, the identified 'domain concentration biases' and the risk catalog may not generalize, weakening the roadmap for trustworthy systems.
Authors: We accept the point that claims must be qualified by the underlying sample. We will revise the empirical synthesis and roadmap sections to (1) tabulate the venues and domains of the reviewed benchmarks, (2) explicitly state that observed generalization failures and manipulation vectors are drawn primarily from CS/NLP studies, and (3) add a forward-looking paragraph recommending construction of cross-domain benchmarks. This keeps the risk catalog intact while preventing over-generalization. revision: yes
Circularity Check
No circularity: survey synthesizes external benchmarks without self-referential reductions
full rationale
This is a survey paper presenting a taxonomy and synthesis of existing LLM peer-review studies and benchmarks. No derivations, equations, fitted parameters, or predictions appear that reduce by construction to the authors' own inputs. Claims rest on analysis of external literature; the representativeness concern raised in the skeptic note is a sampling/correctness issue, not a circularity reduction. No self-citation chains or ansatzes are load-bearing in the provided text.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits
- domain assumption LLMs can be evaluated on two core functions: critique generation and score prediction
Reference graph
Works this paper leans on
-
[1]
Aaai launches ai-powered peer review assessment system
AAAI. Aaai launches ai-powered peer review assessment system. https://aaai.org/ aaai-launches-ai-powered-peer-review-assessment-system/, 2025. Published: 2025-05-16
2025
-
[2]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Banerjee and A
S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005
2005
-
[4]
P. K. Bharti, V. Dalal, and M. Panchal. Co-reviewer: can ai review like a human? an agentic framework for llm-human alignment in peer review.Scientometrics, pages 1–42, 2026
2026
-
[5]
P. K. Bharti, S. Ranjan, T. Ghosal, M. Agrawal, and A. Ekbal. Peerassist: leveraging on paper-review interactions to predict peer review decisions. InTowards Open and Trustworthy Digital Societies: 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings 23, pages 421–435. Springer, 2021
2021
-
[6]
Biggio and F
B. Biggio and F. Roli. Wild patterns: Ten years after the rise of adversarial machine learning. InProceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 2154–2156, 2018
2018
-
[7]
Biswas, D
S. Biswas, D. Dobaria, and H. L. Cohen. Chatgpt and the future of journal reviews: a feasibility study.The Yale Journal of Biology and Medicine, 96(3):415, 2023
2023
-
[8]
Bornmann
L. Bornmann. Scientific peer review.Annual review of information science and technology, 45(1):197–245, 2011
2011
-
[9]
Chai and R
T. Chai and R. R. Draxler. Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature.Geoscientific model development, 7(3):1247–1250, 2014
2014
-
[10]
S. Chen, D. Brumby, and A. Cox. Envisioning the future of peer review: Inves- tigating llm-assisted reviewing using chatgpt as a case study. InProceedings of the 4th Annual Symposium on Human-Computer Interaction for Work, pages 1–18, 2025
2025
- [11]
-
[12]
J. Du, Y. Wang, W. Zhao, Z. Deng, S. Liu, R. Lou, H. P. Zou, P. N. Venkit, N. Zhang, M. Srinath, et al. Llms assist nlp researchers: Critique paper (meta-) reviewing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
2024
-
[13]
Dycke, I
N. Dycke, I. Kuznetsov, and I. Gurevych. Nlpeer: A unified resource for the computational study of peer review. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023
2023
-
[14]
A. R. B. M. Faizullah, A. Urlana, and R. Mishra. Limgen: Probing the llms for generating suggestive limitations of research papers. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 106–124. Springer, 2024
2024
- [15]
-
[16]
Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, H. Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [17]
-
[18]
Hosseini and S
M. Hosseini and S. P. Horbach. Fighting reviewer fatigue or amplifying bias? considerations and recommendations for use of chatgpt and other large language models in scholarly peer review.Research integrity and peer review, 8(1):4, 2023
2023
-
[19]
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
2022
-
[20]
X. Hua, M. Nikolov, N. Badugu, and L. Wang. Argument mining for understanding peer reviews. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics), 2019
2019
-
[21]
Huang, W
L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025
2025
-
[22]
Leveraging llm feedback to enhance review quality
ICLR. Leveraging llm feedback to enhance review quality. https://blog.iclr.cc/ 2025/04/15/leveraging-llm-feedback-to-enhance-review-quality/, 2025. Pub- lished: 2025-04-15
2025
-
[23]
Idahl and Z
M. Idahl and Z. Ahmadi. Openreviewer: A specialized large language model for generating critical scientific paper reviews. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025
2025
-
[24]
Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023
2023
-
[25]
Y. Jin, Q. Zhao, Y. Wang, H. Chen, K. Zhu, Y. Xiao, and J. Wang. Agentreview: Exploring peer review dynamics with llm agents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
2024
-
[26]
D. Kang, W. Ammar, B. Dalvi, M. Van Zuylen, S. Kohlmeier, E. Hovy, and R. Schwartz. A dataset of peer reviews (peerread): Collection, insights and nlp applications. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018
2018
- [27]
-
[28]
Kojima, S
T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022
2022
-
[29]
Künzli, A
N. Künzli, A. Berger, K. Czabanowska, R. Lucas, A. Madarasova Geckova, S. Mantwill, and O. von Dem Knesebeck. «i do not have time»—is this the end of peer review in public health sciences?Public health reviews, 43:1605407, 2022
2022
-
[30]
I. Kuznetsov, O. M. Afzal, K. Dercksen, N. Dycke, A. Goldberg, T. Hope, D. Hovy, J. K. Kummerfeld, A. Lauscher, K. Leyton-Brown, et al. What can natural language processing do for peer review?arXiv preprint arXiv:2405.06563, 2024
- [31]
-
[32]
Lewis, E
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
2020
-
[33]
J. Li, W. X. Zhao, J.-R. Wen, and Y. Song. Generating long and informative reviews with aspect-aware coarse-to-fine decoding. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1969–1979, 2019
1969
-
[34]
Liang, Y
W. Liang, Y. Zhang, H. Cao, B. Wang, D. Y. Ding, X. Yang, K. Vodrahalli, S. He, D. S. Smith, Y. Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis.NEJM AI, 1(8):AIoa2400196, 2024
2024
-
[35]
C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
2004
-
[36]
J. Lin, J. Song, Z. Zhou, Y. Chen, and X. Shi. Moprd: A multidisciplinary open peer review dataset.Neural Computing and Applications, 35(34):24191–24206, 2023
2023
-
[37]
P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023
2023
- [38]
- [39]
-
[40]
S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettle- moyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
2022
-
[41]
Are We Truly Innovating? A Qualitative and Quantitative Study of Originality in AI Research Papers
A. Mostafa, T. H. Nguyen, and Z. Ahmadi. What is novel? a knowledge- driven framework for bias-aware literature originality evaluation.arXiv preprint arXiv:2602.06054, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
Mulligan, L
A. Mulligan, L. Hall, and E. Raphael. Peer review in a changing world: An inter- national study measuring the attitudes of researchers.Journal of the American Society for Information Science and Technology, 64(1):132–161, 2013
2013
-
[43]
Chatgpt (mar 14 version)
OpenAI. Chatgpt (mar 14 version). https://chat.openai.com, 2023. Accessed: 2025-04-29
2023
-
[44]
Issues of ai and academic transparency
OPUS Project Consortium. Issues of ai and academic transparency. https://opusproject.eu/openscience-news/issues-of-ai-and-academic- transparency/?utm_source=chatgpt.com, 2024. published: 2024-05-03
2024
-
[45]
Paper copilot statistics
Paper Copilot. Paper copilot statistics. https://papercopilot.com/statistics/, 2026
2026
-
[46]
Papineni, S
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
2002
-
[47]
P. Pataranutaporn, N. Powdthavee, and P. Maes. Can ai solve the peer review crisis? a large scale experiment on llm’s performance and biases in evaluating ACM SIGKDD Explorations Newsletter, June 2026, Woodstock, NY Nguyen and Ahmadi economics papers.arXiv preprint arXiv:2502.00070, 2025
-
[48]
E. L. Pier, M. Brauer, A. Filut, A. Kaatz, J. Raclaw, M. J. Nathan, C. E. Ford, and M. Carnes. Low agreement among reviewers evaluating the same nih grant applications.Proceedings of the National Academy of Sciences, 115(12):2952–2957, 2018
2018
-
[49]
Price and P
S. Price and P. A. Flach. Computational support for academic peer review: A perspective from artificial intelligence.Communications of the ACM, 60(3):70–79, 2017
2017
- [50]
-
[51]
A. Saad, N. Jenko, S. Ariyaratne, N. Birch, K. P. Iyengar, A. M. Davies, R. Vaishya, and R. Botchu. Exploring the potential of chatgpt in the peer review process: an observational study.Diabetes & Metabolic Syndrome: Clinical Research & Reviews, 18(2):102946, 2024
2024
-
[52]
Stahl, L
M. Stahl, L. Biermann, A. Nehring, and H. Wachsmuth. Exploring llm prompting strategies for joint essay scoring and feedback generation. InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), 2024
2024
-
[53]
Stappen, G
L. Stappen, G. Rizos, M. Hasan, T. Hain, and B. W. Schuller. Uncertainty-aware machine support for paper reviewing on the interspeech 2019 submission corpus. In21st Annual Conference of the International Speech Communication Association, 2020
2019
-
[54]
P. Sukpanichnant, A. Rapberger, and F. Toni. Peerarg: Argumentative peer review with llms.arXiv preprint arXiv:2409.16813, 2024
-
[55]
P. Taechoyotin and D. Acuna. Remor: Automated peer review generation with llm reasoning and multi-objective reinforcement learning.arXiv preprint arXiv:2505.11718, 2025
- [56]
-
[57]
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [59]
-
[60]
Wang and Y
Q. Wang and Y. Tan. Grammatical error detection with self attention by pairwise training. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2020
2020
-
[61]
Q. Wang, Q. Zeng, L. Huang, K. Knight, H. Ji, and N. F. Rajani. ReviewRobot: Explainable paper review generation based on knowledge synthesis. InProceed- ings of the 13th International Conference on Natural Language Generation, pages 384–397, 2020
2020
-
[62]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
2022
-
[63]
Y. Weng, M. Zhu, G. Bao, H. Zhang, J. Wang, Y. Zhang, and L. Yang. Cyclere- searcher: Improving automated research via automated review. 2025
2025
-
[64]
C. J. Willmott and K. Matsuura. Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Climate research, 30(1):79–82, 2005
2005
- [65]
- [66]
-
[67]
J. Yu, Z. Ding, J. Tan, K. Luo, Z. Weng, C. Gong, L. Zeng, R. Cui, C. Han, Q. Sun, Z. Wu, Y. Lan, and X. Li. Automated peer reviewing in paper SEA: Standardiza- tion, evaluation, and analysis. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024
2024
- [68]
-
[69]
Yuan and P
W. Yuan and P. Liu. Kid-review: knowledge-guided scientific review generation with oracle pre-training. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11639–11647, 2022
2022
-
[70]
W. Yuan, P. Liu, and G. Neubig. Can we automate scientific reviewing?Journal of Artificial Intelligence Research, 75:171–212, 2022
2022
- [71]
-
[72]
BERTScore: Evaluating Text Generation with BERT
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[73]
W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 563–578, 2019
2019
-
[74]
Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh. Calibrate before use: Improving few-shot performance of language models. InInternational conference on machine learning, pages 12697–12706. PMLR, 2021
2021
-
[75]
R. Zhou, L. Chen, and K. Yu. Is llm a reliable reviewer? a comprehensive evaluation of llm on automatic paper reviewing tasks. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9340–9351, 2024
2024
- [76]
- [77]
- [78]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.