Recognition: unknown
AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
Pith reviewed 2026-05-10 12:55 UTC · model grok-4.3
The pith
State-of-the-art AI methods can generate peer reviews preferred by authors and committee members for technical accuracy at conference scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A multi-stage system combining frontier models, tool use, and safeguards generated AI reviews for every main-track submission at the conference. Surveys indicated that authors and program committee members not only found the AI reviews useful but preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. A novel benchmark demonstrated that the system substantially outperforms a simple LLM-generated review baseline at detecting various scientific weaknesses.
What carries the argument
The multi-stage AI review generation system that uses frontier models with tool use and safeguards to create reviews for all submissions.
Load-bearing premise
Survey responses from authors and program committee members reflect genuine review quality rather than being skewed by novelty effects or other unmeasured biases.
What would settle it
A follow-up experiment in which independent experts, blind to the source, rate paired AI and human reviews on the same set of papers for technical soundness, completeness, and helpfulness, with results showing no advantage for AI.
Figures
read the original abstract
Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness. Recent advances in AI have led the community to consider its use in peer review, yet a key unresolved question is whether AI can generate technically sound reviews at real-world conference scale. Here we report the first large-scale field deployment of AI-assisted peer review: every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system. The system combined frontier models, tool use, and safeguards in a multi-stage process to generate reviews for all 22,977 full-review papers in less than a day. A large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. We also introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses. Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports the first large-scale deployment of an AI peer-review system at AAAI-26, generating clearly labeled AI reviews for all 22,977 main-track submissions in under a day using frontier models, tool use, and safeguards. It presents survey results from authors and PC members indicating preference for AI reviews over human ones on technical accuracy and research suggestions, introduces a novel benchmark where the system outperforms a simple LLM-generated review baseline at detecting scientific weaknesses, and concludes that state-of-the-art AI can already make meaningful contributions to peer review at conference scale.
Significance. If the empirical results hold, this is a significant contribution as the first reported real-world, conference-scale field test of AI-assisted review. The deployment scale (nearly 23k papers) and dual evidence from survey plus benchmark provide concrete data on feasibility. Credit is due for the practical engineering of the multi-stage pipeline and for releasing a new benchmark for review quality assessment.
major comments (3)
- [Survey results section] Survey results section: response rates, sampling frame, and any statistical tests comparing AI vs. human reviews on accuracy/suggestions are not reported. This is load-bearing for the central claim, as the reported preference cannot be interpreted without these details (potential self-selection or novelty bias unmeasured).
- [Benchmark section] Benchmark section: the set of scientific weaknesses is constructed internally without external validation against documented real-world review failures or direct comparison to human reviewer detection rates. This undermines the claim of superiority over the simple LLM baseline for practical review utility.
- [Abstract and evaluation sections] Abstract and evaluation sections: the survey explicitly labels AI reviews as such, yet no controls or measurements for social-desirability or positivity bias are described, leaving open whether preferences reflect objective quality or labeling effects.
minor comments (2)
- [Benchmark section] Clarify the exact composition of the 'simple LLM-generated review baseline' (prompting details, model version) to allow replication.
- [System description] The multi-stage pipeline description would benefit from a diagram or pseudocode to illustrate the safeguards and tool-use steps.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and for acknowledging the significance of this large-scale deployment. We provide point-by-point responses to the major comments below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Survey results section] Survey results section: response rates, sampling frame, and any statistical tests comparing AI vs. human reviews on accuracy/suggestions are not reported. This is load-bearing for the central claim, as the reported preference cannot be interpreted without these details (potential self-selection or novelty bias unmeasured).
Authors: We agree that these details are necessary to fully interpret the survey results. In the revised manuscript, we will add the response rates and sampling frame details (all authors and PC members were invited to the survey). We will also report the statistical tests used to compare preferences between AI and human reviews on the dimensions of accuracy and suggestions. Furthermore, we will expand the limitations section to discuss potential self-selection and novelty biases. revision: yes
-
Referee: [Benchmark section] Benchmark section: the set of scientific weaknesses is constructed internally without external validation against documented real-world review failures or direct comparison to human reviewer detection rates. This undermines the claim of superiority over the simple LLM baseline for practical review utility.
Authors: The benchmark provides a standardized way to evaluate the AI system's performance on detecting predefined scientific weaknesses, and our claim is specifically that it outperforms the simple LLM baseline on this benchmark. We will revise the section to provide more detail on the construction of the weakness categories, drawing from common issues in peer review. We will also add an explicit discussion of the limitations, including the internal construction and lack of direct comparison to human reviewer performance, as we do not have such paired data available. revision: partial
-
Referee: [Abstract and evaluation sections] Abstract and evaluation sections: the survey explicitly labels AI reviews as such, yet no controls or measurements for social-desirability or positivity bias are described, leaving open whether preferences reflect objective quality or labeling effects.
Authors: We acknowledge the potential for labeling effects in the survey design. The revised manuscript will include additional text in the evaluation section discussing this possible bias and its implications for interpreting the preference results. We note that while no blinded control was implemented, the survey was conducted after the reviews were provided, and preferences were consistent across different groups of respondents. revision: yes
- Direct comparison to human reviewer detection rates on the benchmark, as this would require a separate study with human reviewers evaluating the same set of papers for the defined weaknesses.
Circularity Check
No significant circularity: empirical deployment report grounded in external data collection
full rationale
The paper presents results from a real-world deployment of AI-generated reviews for all AAAI-26 submissions, followed by surveys of authors and PC members plus a new benchmark for detecting scientific weaknesses. No mathematical derivation chain, equations, fitted parameters, or self-referential definitions exist. Central claims rest on collected survey responses and benchmark performance against an external baseline, with no load-bearing steps that reduce by construction to the paper's own inputs or prior self-citations. This is a standard empirical field study whose validity hinges on data quality rather than definitional equivalence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Survey responses from authors and program committee members provide an unbiased measure of review usefulness and technical accuracy
Reference graph
Works this paper leans on
-
[1]
Why did the nature index grow by 16% in 2024?Nature Index, 2025
Nature Index. Why did the nature index grow by 16% in 2024?Nature Index, 2025. URL https://www.nature.com/nature-index/news/why-did- the-nature-index-grow-by-sixteen-percent-in-twenty- twenty-four. Accessed: 2026-03-28
2024
-
[2]
The mirage of au- tonomous ai scientists
Chenhao Tan and Haokun Liu. The mirage of au- tonomous ai scientists. https://chenhaot.com/papers/ mirage ai scientist.pdf, 2026. February 2
2026
-
[3]
AAAI-26 Review Process Update: Scale, Integrity Measures, and Pathways to Sustainabil- ity
Association for the Advancement of Artificial Intel- ligence. AAAI-26 Review Process Update: Scale, Integrity Measures, and Pathways to Sustainabil- ity. https://aaai.org/conference/aaai/aaai-26/review- process-update/, 2025. Accessed: 2026-03-14
2025
-
[4]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeek- Math: Pushing the Limits of Mathematical Reason- ing in Open Language Models, 2024. URL https: //arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence,
-
[6]
URL https://arxiv.org/abs/2401.14196
work page internal anchor Pith review arXiv
-
[7]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. In Proceedings of the Conference on Language Mod- eling, 2024. URL https://openreview.net/forum?id= Ti67584b98
2024
-
[8]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foer- ster, Jeff Clune, and David Ha. The AI scientist: To- wards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bod- hisattwa Prasad Majumder, Daniel S Weld, and Pe- ter Clark. CodeScientist: End-to-end semi-automated scientific discovery with code-based experimentation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of ...
-
[10]
URL https://aclanthology.org/2025.findings-acl. 692/
2025
-
[11]
Towards execution-grounded automated ai research, 2026
Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Cand`es, Diyi Yang, and Tatsunori Hashimoto. Towards execution-grounded automated AI research.arXiv preprint arXiv:2601.14525, 2026
-
[12]
Aniketh Garikaparthi, Manasi Patwardhan, and Ar- man Cohan. Researchgym: Evaluating language model agents on real-world AI research.arXiv preprint arXiv:2602.15112, 2026
-
[13]
Yu, and Wenpeng Yin
Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Deng, Shuaiqi Liu, Renze Lou, Henry Peng Zou, Pranav Narayanan Venkit, Nan Zhang, Mukund Sri- nath, Haoran Ranran Zhang, Vipul Gupta, Yinghui Li, Tao Li, Fei Wang, Qin Liu, Tianlin Liu, Pengzhi Gao, Congying Xia, Chen Xing, Cheng Jiayang, Zhaowei Wang, Ying Su, Raj Sanjay Shah, Ruohao Guo, Jing Gu, Haoran Li,...
2024
-
[14]
URL https://aclanthology.org/2024.emnlp-main.292/
Association for Computational Linguistics. URL https://aclanthology.org/2024.emnlp-main.292/
2024
-
[15]
Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas V odrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, and James Zou. Can Large Language Models Pro- vide Useful Feedback on Research Papers? A Large- Scale Empirical Analysis.NEJM AI, 1(8), 2024. doi: 10.1056/AIoa2400196. URL https://ai.nejm.org/doi/ abs/10.1056...
-
[16]
Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel Mcfarland, and James Y . Zou. Monitoring AI-Modified Con- tent at Scale: A Case Study on the Impact of Chat- GPT on AI Conference Peer Reviews. InProceed- ings of the 41st International Conference on Machine Learning, ...
2024
- [17]
-
[18]
Automated peer reviewing in paper SEA: Standardization, evaluation, and analy- sis
Jianxiang Yu, Zichen Ding, Jiaqi Tan, Kangyang Luo, Zhenmin Weng, Chenghua Gong, Long Zeng, RenJing Cui, Chengcheng Han, Qiushi Sun, Zhiyong Wu, Yun- shi Lan, and Xiang Li. Automated peer reviewing in paper SEA: Standardization, evaluation, and analy- sis. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Com- p...
2024
-
[19]
Is LLM a Reli- able Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks
Ruiyang Zhou, Lu Chen, and Kai Yu. Is LLM a Reli- able Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks. InProceed- ings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9340–9351, Torino, Italia, 2024. ELRA and ICCL. URL https: //aclanthology...
2024
-
[20]
Genera- tive Reviewer Agents: Scalable Simulacra of Peer Re- view
Nicolas Bougie and Narimawa Watanabe. Genera- tive Reviewer Agents: Scalable Simulacra of Peer Re- view. InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing: In- dustry Track, pages 98–116, Suzhou, China, 2025. As- sociation for Computational Linguistics. URL https: //aclanthology.org/2025.emnlp-industry.8/
2025
-
[21]
ReviewAgents: Bridg- ing the Gap Between Human and AI-Generated Paper Reviews, 2025
Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, and Yuzhuo Fu. ReviewAgents: Bridg- ing the Gap Between Human and AI-Generated Paper Reviews, 2025. URL https://arxiv.org/abs/2503.08506
-
[22]
Mind the blind spots: A focus-level evaluation framework for LLM reviews
Hyungyu Shin, Jingyu Tang, Yoonjoo Lee, Nayoung Kim, Hyunseung Lim, Ji Yong Cho, Hwajung Hong, Moontae Lee, and Juho Kim. Mind the blind spots: A focus-level evaluation framework for LLM reviews. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Lan...
2025
-
[23]
Can LLMs identify criti- cal limitations within scientific research? a system- atic evaluation on AI research papers
Zhijian Xu, Yilun Zhao, Manasi Patwardhan, Lovekesh Vig, and Arman Cohan. Can LLMs identify criti- cal limitations within scientific research? a system- atic evaluation on AI research papers. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational...
2025
-
[24]
More than half of researchers now use AI for peer review—often against guidance.Na- ture, 649(8096):273–274, 2026
Miryam Naddaf. More than half of researchers now use AI for peer review—often against guidance.Na- ture, 649(8096):273–274, 2026
2026
-
[25]
arXiv preprint arXiv:2405.02150
Giuseppe Russo Latona, Manoel Horta Ribeiro, Tim R. Davidson, Veniamin Veselovsky, and Robert West. The AI Review Lottery: Widespread AI-Assisted Peer Re- views Boost Paper Scores and Acceptance Rates, 2024. URL https://arxiv.org/abs/2405.02150
- [26]
-
[27]
Larochelle, Laurent Charlin, and Christopher Pal
Gaurav Sahu, Hugo Larochelle, Laurent Charlin, and Christopher Pal. ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review.arXiv preprint arXiv:2510.08867, 2025
-
[28]
AAAR-1.0: Assess- ing AI’s Potential to Assist Research
Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, and Wenpeng Yin. AAAR-1.0: Assess- ing AI’s Potential to Assist Research. InProceed- ings of the 42nd International Conference on Machine Learning, v...
2025
-
[29]
ReviewEval: An evaluation framework for AI- generated reviews
Madhav Krishan Garg, Tejash Prasad, Tanmay Sing- hal, Chhavi Kirtani, Murari Mandal, and Dhruv Ku- mar. ReviewEval: An evaluation framework for AI- generated reviews. In Christos Christodoulopou- los, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Com- putational Linguistics: EMNLP 2025, pages 20542– 20564, Suzh...
2025
- [30]
-
[31]
Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Ani- mesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl V ondrick, and James Zou. A large-scale randomized study of large language model feedback in peer re- view.Nature Machine Intelligence, pages 1–11, 2026. doi: 10.1038/s42256-026-01188-x. URL https://doi. org/10.1038/s42256-026-01188-x
-
[32]
Tianmai M. Zhang and Neil F. Abernethy. Reviewing Scientific Papers for Critical Problems With Reason- ing LLMs: Baseline Approaches and Automatic Eval- uation, 2025. URL https://arxiv.org/abs/2505.23824. NeurIPS 2025 Workshop on AI for Science: The Reach and Limits of AI for Scientific Discovery
-
[33]
M. Zhu, Y . Weng, L. Yang, and Y . Zhang. Deep- Review: Improving LLM-Based Paper Review with Human-Like Deep Thinking Process. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29330–29355, Vienna, Austria, 2025. Associ- ation for Computational Linguistics. doi: 10.18653/ v1/2025.acl-...
2025
-
[34]
Chang, Z
Y . Chang, Z. Li, H. Zhang, Y . Kong, Y . Wu, H. K.- H. So, Z. Guo, L. Zhu, and N. Wong. TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-Based Scientific Peer Review. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 15651– 15682, Suzhou, China, 2025. Association for Com- putat...
2025
-
[35]
K. Lu, S. Xu, J. Li, K. Ding, and G. Meng. Agent Reviewers: Domain-Specific Multimodal Agents with Shared Memory for Paper Review. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 40803–40830, Vancouver, Canada, 2025. PMLR. URL https://proceedings.mlr. press/v267/lu25p.html
2025
-
[36]
AI Reviewers: Are Hu- man Reviewers Still Necessary? InProceedings of the Human Factors and Ergonomics Society Annual Meet- ing, volume 69, pages 338–342
Vianney Renata and John Lee. AI Reviewers: Are Hu- man Reviewers Still Necessary? InProceedings of the Human Factors and Ergonomics Society Annual Meet- ing, volume 69, pages 338–342. SAGE Publications Sage CA: Los Angeles, CA, 2025
2025
-
[37]
Jake Poznanski, Aman Rangapur, Jon Borchardt, Ja- son Dunkelberger, Regan Huff, Daniel Lin, Christo- pher Wilhelm, Kyle Lo, and Luca Soldaini. olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models, 2025. URL https://arxiv.org/abs/ 2502.18443
-
[38]
Markdown: Syntax
John Gruber. Markdown: Syntax. http://daringfireball. net/projects/markdown/syntax. Retrieved on June, 2012
2012
-
[39]
Alexander Goldberg, Ivan Stelmakh, Kyunghyun Cho, Alice Oh, Alekh Agarwal, Danielle Belgrave, and Ni- har B. Shah. Peer reviews of peer reviews: A ran- domized controlled trial and other experiments.PLOS ONE, 20(4):e0320444, 2025. doi: 10.1371/journal. pone.0320444. URL https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0320444
-
[40]
Alexander Knipper, Souvika Sarkar, John Salvador, Yash Mahajan, Sri Ram Pavan Kumar Guttikonda, Mousumi Akter, Md
Eftekhar Hossain, Sanjeev Kumar Sinha, Naman Bansal, R. Alexander Knipper, Souvika Sarkar, John Salvador, Yash Mahajan, Sri Ram Pavan Kumar Guttikonda, Mousumi Akter, Md. Mahadi Hassan, Matthew Freestone, Matthew C. Williams Jr., Dongji Feng, and Santu Karmaker. LLMs as meta-reviewers’ assistants: A case study. In Luis Chiruzzo, Alan Rit- ter, and Lu Wang...
2025
-
[41]
URL https://aclanthology.org/2025.naacl-long.395/
Association for Computational Linguistics. URL https://aclanthology.org/2025.naacl-long.395/
2025
-
[42]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 System Card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
File inputs
OpenAI. File inputs. https://developers.openai.com/ api/docs/guides/file-inputs/, 2026. OpenAI API docu- mentation. Accessed: 2026-03-16
2026
-
[44]
GPTZero Developer Docs
GPTZero. GPTZero Developer Docs. https://gptzero. me/docs, 2026. API documentation. Accessed: 2026- 03-16
2026
-
[45]
Parker, Caitlin Anderson, Claire Stone, and YeaRim Oh
Michael J. Parker, Caitlin Anderson, Claire Stone, and YeaRim Oh. A large language model approach to edu- cational survey feedback analysis.International Jour- nal of Artificial Intelligence in Education, 35(2):444– 481, Jun 2025. ISSN 1560-4306. doi: 10.1007/s40593- 024-00414-0. URL https://doi.org/10.1007/s40593- 024-00414-0
-
[46]
From voices to validity: Leveraging large language models (llms) for textual analysis of policy stakeholder interviews.AERA Open, 11:23328584251374595, 2025
Alex Liu and Min Sun. From voices to validity: Leveraging large language models (llms) for textual analysis of policy stakeholder interviews.AERA Open, 11:23328584251374595, 2025. doi: 10.1177/ 23328584251374595
2025
-
[47]
Paul Rozin and Edward B. Royzman. Negativity Bias, Negativity Dominance, and Contagion.Personality and Social Psychology Review, 5(4):296–320, 2001. doi: 10.1207/S15327957PSPR0504 2
-
[48]
Reanna M. Poncheri, Jennifer T. Lindberg, Lori Foster Thompson, and Eric A. Surface. A Comment on Em- ployee Surveys: Negativity Bias in Open-Ended Re- sponses.Organizational Research Methods, 11(3): 614–630, 2008. doi: 10.1177/1094428106295504. A Appendix A.1 Prompt design We report the prompt structure and design for the AAAI- 26 AI Review System. The e...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.