arxiv: 2604.19578 · v1 · submitted 2026-04-21 · 💻 cs.CL · cs.AI· cs.DL· cs.IR

Recognition: unknown

Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

Wenqing Wu , Chengzhi Zhang , Yi Zhao , Tong Bao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DLcs.IR

keywords peer reviewlarge language modelsLLMAI conferencesevaluation aspectslinguistic analysisreview reportsacademic publishing

0 comments

The pith

LLMs have made peer review texts in AI conferences longer and more focused on summaries, while reducing attention to originality and critical reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies changes in peer review reports from major AI conferences since large language models became available. It finds that these reports grew longer and more fluent, with reviewers putting more weight on summaries and surface-level clarity. Meanwhile, comments on deeper issues like originality, replicability, and detailed criticism became less common, especially from reviewers who rate their own confidence lower. This shift is important because peer review aims to improve manuscript quality through careful evaluation, and if the balance moves toward easier surface checks, it could change what research gets published. The analysis relies on measuring word and sentence features plus automatic tagging of what each sentence evaluates in the review.

Core claim

Following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

What carries the argument

Fine-grained automatic annotation of evaluation aspects in individual review sentences, paired with maximum likelihood estimation to identify likely LLM-assisted reports for comparison across time periods.

If this is right

Reviews place greater emphasis on summaries and surface-level clarity.
Attention to originality, replicability, and nuanced critical reasoning has decreased.
Linguistic patterns have become more standardized, especially among lower-confidence reviewers.
The informativeness of recommendation signals for paper decisions may be impacted by these focus shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds, AI conferences may need to revise review guidelines to maintain depth in evaluations.
Over time, this could affect the overall rigor and direction of published AI research.
Researchers could examine whether papers accepted under more LLM-influenced reviews show different long-term impact.
Tools to assist reviewers might be designed to prompt for deeper critical analysis rather than just fluency.

Load-bearing premise

The changes in review length, fluency, and evaluative focus result from LLM adoption rather than other concurrent changes in the field or review process.

What would settle it

A study that compares reviews from the same reviewers before and after LLM availability, or directly measures the proportion of LLM-generated content in reviews and correlates it with the observed changes.

read the original abstract

With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies fresh measurements of how AI conference reviews have shifted since LLMs arrived, but the evidence tying those shifts to LLM use is observational and lacks controls.

read the letter

This paper tracks peer reviews from top AI conferences and reports that after LLMs appeared, the texts got longer and more fluent, with more emphasis on summaries and surface clarity and less on originality, replicability, or nuanced critique. Reviews from lower-confidence reviewers show more standardized patterns. They use an existing MLE method to flag likely LLM-assisted reports and check how those affect recommendation informativeness.

Referee Report

3 major / 3 minor

Summary. The manuscript examines changes in peer-review reports from top AI conferences following the emergence of LLMs. It analyzes linguistic features (length, word/sentence complexity, fluency), applies automatic sentence-level annotation of evaluation aspects (e.g., summary, clarity, originality, replicability), uses a previously established MLE method to flag potentially LLM-assisted reviews, and assesses how aspect distributions in those reviews relate to recommendation informativeness. The central claim is that post-LLM reviews are longer and more fluent, emphasize summaries and surface-level clarity with more standardized patterns (especially among low-confidence reviewers), while showing reduced attention to deeper dimensions such as originality, replicability, and nuanced critical reasoning.

Significance. If the causal attribution and measurement validity hold, the work would offer a timely fine-grained view of how LLMs may be reshaping evaluative practices in AI peer review, with potential implications for review quality and scientific standards. The sentence-level aspect analysis and integration of linguistic metrics with LLM detection constitute a methodological strength over coarser aggregate studies. The paper would benefit from explicit validation steps to elevate its contribution.

major comments (3)

[§3.3] §3.3 (LLM detection): The MLE-based identification of LLM-assisted reviews is applied without reported validation, accuracy metrics, or robustness checks on peer-review text; because this classification partitions the data for all subsequent contrasts, any systematic bias in detection directly undermines the attribution of shifts to LLM use.
[§4] §4 (aspect annotation): The automatic sentence-level aspect classifier is deployed without human validation, inter-annotator agreement, or error analysis on the peer-review corpus; the reported decline in originality/replicability focus and rise in summary emphasis therefore cannot be distinguished from possible annotation artifacts.
[§5] §5 (results): The before/after design reports directional changes in length, fluency, and aspect distributions but includes no statistical controls or matching for confounders such as reviewer demographics, conference guideline updates, or submission-volume growth; these omissions leave the causal claim load-bearing yet untested.

minor comments (3)

[Abstract] Abstract and §2: The phrase 'reviewers with lower confidence score' is used without defining how confidence is measured or whether it is self-reported by reviewers.
[§3.1] §3.1: The exact formulas or libraries for 'complexity of words and sentences' (e.g., specific readability indices or syntactic metrics) should be stated explicitly rather than left at the level of 'linguistic features'.
[Table 1] Table 1 or equivalent: Sample sizes, period boundaries, and number of reviews per conference should be reported with exact counts to allow assessment of statistical power.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with our responses and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [§3.3] §3.3 (LLM detection): The MLE-based identification of LLM-assisted reviews is applied without reported validation, accuracy metrics, or robustness checks on peer-review text; because this classification partitions the data for all subsequent contrasts, any systematic bias in detection directly undermines the attribution of shifts to LLM use.

Authors: The MLE detection method follows the previously established and validated approach from prior work on LLM-generated text detection. We will revise §3.3 to explicitly report the original method's validation metrics (e.g., accuracy on benchmark datasets) and add robustness checks tailored to peer-review text, including sensitivity analysis across thresholds and a small human-annotated sample to evaluate performance in this domain. This addresses potential bias concerns while preserving the partitioning for subsequent analyses. revision: yes
Referee: [§4] §4 (aspect annotation): The automatic sentence-level aspect classifier is deployed without human validation, inter-annotator agreement, or error analysis on the peer-review corpus; the reported decline in originality/replicability focus and rise in summary emphasis therefore cannot be distinguished from possible annotation artifacts.

Authors: We recognize the need for validation of the automatic aspect classifier. We will add a dedicated validation subsection (or appendix) that includes human annotation on a representative sample of review sentences, reporting inter-annotator agreement metrics such as Cohen's kappa and a detailed error analysis. This will help confirm that shifts in aspect distributions reflect genuine changes rather than classifier artifacts. revision: yes
Referee: [§5] §5 (results): The before/after design reports directional changes in length, fluency, and aspect distributions but includes no statistical controls or matching for confounders such as reviewer demographics, conference guideline updates, or submission-volume growth; these omissions leave the causal claim load-bearing yet untested.

Authors: The study employs an observational before/after design to document temporal shifts post-LLM emergence. We will revise §5 and the discussion to incorporate additional controls for available factors (e.g., conference and year effects) and explicitly discuss potential confounders such as guideline updates and submission growth. We will also temper causal language to emphasize correlational findings and acknowledge limitations where full matching (e.g., for reviewer demographics) is not feasible due to data constraints. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper performs an observational before-after analysis of peer-review texts using standard linguistic metrics (length, fluency, word/sentence complexity) and automatic sentence-level aspect annotation. It additionally applies a previously established MLE method to flag LLM-assisted reports and then compares aspect distributions and recommendation informativeness. No equations, fitted parameters, or derived quantities are defined in terms of the target results themselves; the before-after shifts and LLM-assisted contrasts are computed directly from the data splits and external tools rather than reducing tautologically to inputs. The MLE reference is invoked as an established detection tool rather than a self-derived uniqueness theorem or load-bearing premise that forces the headline findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the analysis rests on standard NLP feature extraction and a previously published MLE procedure whose assumptions are not restated here.

pith-pipeline@v0.9.0 · 5585 in / 1247 out tokens · 49462 ms · 2026-05-10T01:45:49.521032+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 39 canonical work pages · 1 internal anchor

[1]

Proceedings of the National Academy of Sciences 112(2), 360–365 (2015) https://doi.org/10.1073/pnas.1418218112 40

Siler, K., Lee, K., Bero, L.: Measuring the eﬀectiveness of scientiﬁ c gatekeep- ing. Proceedings of the National Academy of Sciences 112(2), 360–365 (2015) https://doi.org/10.1073/pnas.1418218112 40

work page doi:10.1073/pnas.1418218112 2015
[2]

Scientometrics 113(1), 633–650 (2017) https://doi.org/10.1007/s11192-017-2310-5

Huisman, J., Smits, J.: Duration and quality of the peer review pro- cess: the author’s perspective. Scientometrics 113(1), 633–650 (2017) https://doi.org/10.1007/s11192-017-2310-5

work page doi:10.1007/s11192-017-2310-5 2017
[3]

Scientometrics 130(3), 1547–1569 (2025) https://doi.org/10.1007/s11192-025-05264-8

Yu, H., Liang, Y., Xie, Y.: Understanding the sustainability of supply–demand in peer review system: an analysis based on scholars ’ research and review activities. Scientometrics 130(3), 1547–1569 (2025) https://doi.org/10.1007/s11192-025-05264-8

work page doi:10.1007/s11192-025-05264-8 2025
[4]

FEMS Microbiolo gy Letters 365(19), 204 (2018) https://doi.org/10.1093/femsle/fny204

Tennant, J.P.: The state of the art in peer review. FEMS Microbiolo gy Letters 365(19), 204 (2018) https://doi.org/10.1093/femsle/fny204

work page doi:10.1093/femsle/fny204 2018
[5]

https://arxiv.org/abs/2106.00810

Russo, A.: Some Ethical Issues in the Review Process of Machine L earning Conferences (2021). https://arxiv.org/abs/2106.00810

work page arXiv 2021
[6]

https://openreview.net/forum?id=Cn706AbJaKW

Tran, D., Valtchanov, A.V., Ganapathy, K.R., Feng, R., Slud, E.V., Go ld- blum, M., Goldstein, T.: An Open Review of OpenReview: A Criti- cal Analysis of the Machine Learning Conference Review Process (2 021). https://openreview.net/forum?id=Cn706AbJaKW
[7]

Stelmakh, I., Shah, N.B., Singh, A., Daum´ e, H.: Prior and prejudice : The novice reviewers’ bias against resubmissions in conference peer review. Proc. ACM Hum.- Comput. Interact. 5(CSCW1) (2021) https://doi.org/10.1145/3449149

work page doi:10.1145/3449149 2021
[8]

https://arxiv.org/abs/2211.06398

Zhang, J., Zhang, H., Deng, Z., Roth, D.: Investigating Fairness D is- parities in Peer Review: A Language Model Enhanced Approach (202 2). https://arxiv.org/abs/2211.06398

work page arXiv
[9]

Functional Ecology 37(5), 1144–1157 (2023) https://doi.org/10.1111/1365-2435.14259

Fox, C.W., Meyer, J., Aim´ e, E.: Double-blind peer review aﬀects rev iewer ratings and editor decisions at an ecology journal. Functional Ecology 37(5), 1144–1157 (2023) https://doi.org/10.1111/1365-2435.14259

work page doi:10.1111/1365-2435.14259 2023
[10]

In: Proceedings of the 37th International Conference on Neura l Information Pro- cessing Systems

Lu, Y., Kong, Y.: Calibrating ”cheap signals” in peer review without a prior. In: Proceedings of the 37th International Conference on Neura l Information Pro- cessing Systems. NIPS ’23. Curran Associates Inc., Red Hook, NY, USA (2024). https://doi.org/10.5555/3666122.3667028

work page doi:10.5555/3666122.3667028 2024
[11]

In: Proceedings of the 37th Int ernational Con- ference on Neural Information Processing Systems

Xu, Y.E., Jecmen, S., Song, Z., Fang, F.: A one-size-ﬁts-all appr oach to improving randomness in paper assignment. In: Proceedings of the 37th Int ernational Con- ference on Neural Information Processing Systems. NIPS ’23. Cu rran Associates Inc., Red Hook, NY, USA (2024)

2024
[12]

https://arxiv.org/abs/2310.05966

Liu, Y., Yang, K., Liu, Y., Drew, M.G.B.: The Shackles of Peer Review: Unveiling the Flaws in the Ivory Tower (2023). https://arxiv.org/abs/2310.05966

work page arXiv 2023
[14]

Yuan, W., Liu, P., Neubig, G.: Can we automate scientiﬁc reviewing? J. Artif. Int. Res. 75 (2022) https://doi.org/10.1613/jair.1.12862

work page doi:10.1613/jair.1.12862 2022
[15]

Proceedings of the AAAI Conference on Artiﬁ cial Intelligence 36(10), 11639–11647 (2022) https://doi.org/10.1609/aaai.v36i10.21418

Yuan, W., Liu, P.: Kid-review: Knowledge-guided scientiﬁc review g eneration with oracle pre-training. Proceedings of the AAAI Conference on Artiﬁ cial Intelligence 36(10), 11639–11647 (2022) https://doi.org/10.1609/aaai.v36i10.21418

work page doi:10.1609/aaai.v36i10.21418 2022
[19]

In: Al-Onaizan, Y., B ansal, M., Chen, Y.-N

Jin, Y., Zhao, Q., Wang, Y., Chen, H., Zhu, K., Xiao, Y., Wang, J.: Age ntReview: Exploring peer review dynamics with LLM agents. In: Al-Onaizan, Y., B ansal, M., Chen, Y.-N. (eds.) Proceedings of the 2024 Conference on Empir ical Methods in Natural Language Processing, pp. 1208–1226. Association for Computational Linguistics, Miami, Florida, USA (2024). ...

2024
[20]

BMJ 367 (2019) https://doi.org/10.1136/bmj.l6460

Barnett, A., Mewburn, I., Schroter, S.: Working 9 to 5, not the way to make an academic living: observational analysis of manuscript and peer revie w submissions over time. BMJ 367 (2019) https://doi.org/10.1136/bmj.l6460

work page doi:10.1136/bmj.l6460 2019
[21]

Geng, Y., Cao, R., Han, X., Tian, W., Zhang, G., Wang, X.: Scientists are working overtime: when do scientists download scientiﬁc papers? Scientome trics 127(11), 6413–6429 (2022) https://doi.org/10.1007/s11192-022-04524-1

work page doi:10.1007/s11192-022-04524-1 2022
[22]

GPT-4 Technical Report

OpenAI: GPT-4 Technical Report (2024). https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

The Lancet Infectious Diseases 23(7), 781 (2023) https://doi.org/10.1016/S1473-3099(23)00290-6

Donker, T.: The dangers of using large language models for peer review. The Lancet Infectious Diseases 23(7), 781 (2023) https://doi.org/10.1016/S1473-3099(23)00290-6

work page doi:10.1016/s1473-3099(23)00290-6 2023
[24]

42 Nature 628(8008), 483–484 (2024) https://doi.org/10.1038/d41586-024-01051-2

Chawla, D.S.: Is chatgpt corrupting peer review? telltale words h int at ai use. 42 Nature 628(8008), 483–484 (2024) https://doi.org/10.1038/d41586-024-01051-2

work page doi:10.1038/d41586-024-01051-2 2024
[25]

McFarland, and James Zou

Liang, W., Zhang, Y., Cao, H., Wang, B., Ding, D.Y., Yang, X., Vodrah alli, K., He, S., Smith, D.S., Yin, Y., McFarland, D.A., Zou, J.: Can large language m od- els provide useful feedback on research papers? a large-scale em pirical analysis. NEJM AI 1(8), 2400196 (2024) https://doi.org/10.1056/AIoa2400196

work page doi:10.1056/aioa2400196 2024
[26]

In: Salakhutdinov, R., Kolter, Z., Heller, K., W eller, A., Oliver, N., Scarlett, J., Berkenkamp, F

Liang, W., Izzo, Z., Zhang, Y., Lepp, H., Cao, H., Zhao, X., Chen, L ., Ye, H., Liu, S., Huang, Z., McFarland, D., Zou, J.Y.: Monitoring AI-modiﬁed content at scale: A case study on the impact of ChatGPT on AI con- ference peer reviews. In: Salakhutdinov, R., Kolter, Z., Heller, K., W eller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Proceedings ...

2024
[27]

arXiv preprint arXiv:2405.02150

Latona, G.R., Ribeiro, M.H., Davidson, T.R., Veselovsky, V., West, R .: The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Pape r Scores and Acceptance Rates (2024). https://arxiv.org/abs/2405.02150

work page arXiv 2024
[29]

A Dataset of Peer Reviews ( P eer R ead): Collection, Insights and NLP Applications

Kang, D., Ammar, W., Dalvi, B., Zuylen, M., Kohlmeier, S., Hovy, E., Sc hwartz, R.: A dataset of peer reviews (PeerRead): Collection, insights and N LP applica- tions. In: Proceedings of the 2018 Conference of the North Amer ican Chapter of the Association for Computational Linguistics: Human Language Te chnologies, Volume 1 (Long Papers), pp. 1647–1661. ...

work page doi:10.18653/v1/n18-1149 2018
[30]

In: Davis, B., Graham, Y., Kelleher, J., Sripada, Y

Wang, Q., Zeng, Q., Huang, L., Knight, K., Ji, H., Rajani, N.F.: Review Robot: Explainable paper review generation based on knowledge synthesis. In: Davis, B., Graham, Y., Kelleher, J., Sripada, Y. (eds.) Proceedings of the 13th International Conference on Natural Language Gener ation, pp. 384–397. Association for Computational Linguistics, Dublin, Irelan...

work page doi:10.18653/v1/2020.inlg-1.44 2020
[31]

In: Chandrasekaran, M.K., Waard, A., Feigenblat, G., Freitag, D., Ghosal, T., Hovy, E., Knoth, P., Konopnicki, D., Mayr, P., Patton, R.M., Shmueli-Scheuer, M

Li, J., Sato, A., Shimura, K., Fukumoto, F.: Multi-task peer-revie w score predic- tion. In: Chandrasekaran, M.K., Waard, A., Feigenblat, G., Freitag, D., Ghosal, T., Hovy, E., Knoth, P., Konopnicki, D., Mayr, P., Patton, R.M., Shmueli-Scheuer, M. (eds.) Proceedings of the First Workshop on Scholarly Document Process- ing, pp. 121–126. Association for Com...

work page doi:10.18653/v1/2020.sdp-1.14 2020
[32]

43 https://arxiv.org/abs/2502.19614

Yu, S., Luo, M., Madusu, A., Lal, V., Howard, P.: Is Your Paper Bein g Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review ( 2025). 43 https://arxiv.org/abs/2502.19614

work page arXiv 2025
[33]

https://arxiv.org/abs/2502.11736

Garg, M.K., Prasad, T., Singhal, T., Kirtani, C., Mandal, M., Kumar, D .: ReviewEval: An Evaluation Framework for AI-Generated Reviews (2 025). https://arxiv.org/abs/2502.11736

work page arXiv
[34]

Information Fusion 124, 103332 (2025) https://doi.org/10.1016/j.inﬀus.2025.103332

Zhuang, Z., Chen, J., Xu, H., Jiang, Y., Lin, J.: Large language mod els for auto- mated scholarly paper review: A survey. Information Fusion 124, 103332 (2025) https://doi.org/10.1016/j.inﬀus.2025.103332

work page doi:10.1016/j.in 2025
[35]

https://arxiv.org/abs/2307.05492

Robertson, Z.: GPT4 is Slightly Helpful for Peer-Review Assistan ce: A Pilot Study (2023). https://arxiv.org/abs/2307.05492

work page arXiv 2023
[36]

https://arxiv.org/abs/2306.00622

Liu, R., Shah, N.B.: ReviewerGPT? An Exploratory Study on Using L arge Language Models for Paper Reviewing (2023). https://arxiv.org/abs/2306.00622

work page arXiv 2023
[37]

Thelwall, M.: Can chatgpt evaluate research quality? Journal of Data and Information Science 9(2), 1–21 (2024) https://doi.org/10.2478/jdis-2024-0013

work page doi:10.2478/jdis-2024-0013 2024
[38]

In: Calzolari, N., Kan , M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N

Zhou, R., Chen, L., Yu, K.: Is LLM a reliable reviewer? a comprehen sive evalua- tion of LLM on automatic paper reviewing tasks. In: Calzolari, N., Kan , M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N. (eds.) Proceedings of the 202 4 Joint International Conference on Computational Linguistics, Langua ge Resources and Evaluation (LREC-COLING 2024), pp. 934...

2024
[39]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

Du, J., Wang, Y., Zhao, W., Deng, Z., Liu, S., Lou, R., Zou, H.P., Narayanan Venkit, P., Zhang, N., Srinath, M., Zhang, H.R., Gupta, V., Li, Y., Li, T., Wang, F., Liu, Q., Liu, T., Gao, P., Xia, C., Xing, C., Jiayang, C., Wang, Z., Su, Y., Shah, R.S., Guo, R., Gu, J., Li, H., Wei, K., Wang, Z., Chen g, L., Ranathunga, S., Fang, M., Fu, J., Liu, F., Huang,...

2024
[40]

Alireza Ghafarollahi and Markus J Buehler

Gao, Z., Brantley, K., Joachims, T.: Reviewer2: Optimizing Review G eneration Through Prompt Generation (2024). https://arxiv.org/abs/2402.10886

work page arXiv 2024
[41]

Qwen Team

Tan, C., Lyu, D., Li, S., Gao, Z., Wei, J., Ma, S., Liu, Z., Li, S.Z.: Peer Re view as A Multi-Turn and Long-Context Dialogue with Role-Based Interactio ns (2024). https://arxiv.org/abs/2406.05688

work page arXiv 2024
[42]

Marg: Multi-agent review generation for scientific papers.ArXiv, abs/2401.04259,

D’Arcy, M., Hope, T., Birnbaum, L., Downey, D.: MARG: Multi-Agent Review Generation for Scientiﬁc Papers (2024). https://arxiv.org/abs/2401.04259 44

work page arXiv 2024
[43]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

Yu, J., Ding, Z., Tan, J., Luo, K., Weng, Z., Gong, C., Zeng, L., Cui, R ., Han, C., Sun, Q., Wu, Z., Lan, Y., Li, X.: Automated peer reviewing in paper SEA: Stan- dardization, evaluation, and analysis. In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N. (eds.) Findings of the Association for Computational Linguistics: EM NLP 2024, pp. 10164–10184. Association fo...

2024
[44]

I n: Editors: Shushanik Sargsyan, Wolfgang Gl¨ anzel, Giovanni Abramo: 20th In ternational Conference On Scientometrics & Informetrics, pp

Wu, W., Zhang, Y., Haunschild, R., Bornmann, L.: Leveraging large language models for post-publication peer review: Potential and limitations. I n: Editors: Shushanik Sargsyan, Wolfgang Gl¨ anzel, Giovanni Abramo: 20th In ternational Conference On Scientometrics & Informetrics, pp. 23–27 (2025)

2025
[45]

https://openreview.net/forum?id=0CfKpG94co

GENG, M., Trotta, R.: Is chatGPT transforming academics’ writ ing style? In: ICML 2024 Next Generation of AI Safety Workshop (202 4). https://openreview.net/forum?id=0CfKpG94co

2024
[46]

In: Araujo, P.H., Baumann, A., Gromann, D., Krenn, B., Roth, B., Wiegand, M

Evans, J., D’Souza, J., Auer, S.: Large language models as evalua tors for scien- tiﬁc synthesis. In: Araujo, P.H., Baumann, A., Gromann, D., Krenn, B., Roth, B., Wiegand, M. (eds.) Proceedings of the 20th Conference on Natura l Language Pro- cessing (KONVENS 2024), pp. 1–22. Association for Computationa l Linguistics, Vienna, Austria (2024). https://acla...

2024
[47]

Human – Diﬀerentiation Analysis of Scientiﬁc Content Generation (20 23)

Ma, Y., Liu, J., Yi, F., Cheng, Q., Huang, Y., Lu, W., Liu, X.: AI vs. Human – Diﬀerentiation Analysis of Scientiﬁc Content Generation (20 23). https://arxiv.org/abs/2301.10416

work page arXiv
[48]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Mitchell, E., Lee, Y., Khazatsky, A., Manning, C.D., Finn, C.: Detect gpt: Zero- shot machine-generated text detection using probability curvatu re. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Hon olulu, Hawaii, USA. Proceedings of Machine...

2023
[49]

In: The Twelfth International Conference on Learning Represen tations (2024)

Yang, X., Cheng, W., Wu, Y., Petzold, L.R., Wang, W.Y., Chen, H.: DNA -GPT: Divergent n-gram analysis for training-free detection of GPT-gen erated text. In: The Twelfth International Conference on Learning Represen tations (2024). https://openreview.net/forum?id=Xlayxj2fWp

2024
[50]

In: Ku, L.-W ., Mar- tins, A., Srikumar, V

Li, Y., Li, Q., Cui, L., Bi, W., Wang, Z., Wang, L., Yang, L., Shi, S., Zhang , Y.: MAGE: Machine-generated text detection in the wild. In: Ku, L.-W ., Mar- tins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meetin g of the Association for Computational Linguistics (Volume 1: Long Pape rs), pp. 36–53. Association for Computational Linguistics, B...

work page doi:10.18653/v1/2024.acl-long.3 2024
[51]

Crossley

KYLE, K., CROSSLEY, S.A.: Measuring syntactic complexity in l2 writ - ing using ﬁne-grained clausal and phrasal indices. The Modern Lan- guage Journal 102(2), 333–349 (2018) https://doi.org/10.1111/modl.12468 https://onlinelibrary.wiley.com/doi/pdf/10.1111/modl.12468

work page doi:10.1111/modl.12468 2018
[52]

Behavior Research Methods 50(3), 1030–1046 (2018) https://doi.org/10.3758/s13428-017-0924-4

Kyle, K., Crossley, S., Berger, C.: The tool for the automatic an alysis of lexical sophistication (taales): version 2.0. Behavior Research Methods 50(3), 1030–1046 (2018) https://doi.org/10.3758/s13428-017-0924-4

work page doi:10.3758/s13428-017-0924-4 2018
[53]

Maddi, A., Miotti, L.: On the peer review reports: does size matte r? Scientomet- rics 129(10), 5893–5913 (2024) https://doi.org/10.1007/s11192-024-04977-6

work page doi:10.1007/s11192-024-04977-6 2024
[54]

In: Vlachos, A., Augen- stein, I

Guo, Y., Shang, G., Rennard, V., Vazirgiannis, M., Clavel, C.: Autom atic analysis of substantiation in scientiﬁc peer reviews. In: Bouamor, H., Pino, J ., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EM NLP 2023, pp. 10198–10216. Association for Computational Linguistics, Sing apore (2023). https://doi.org/10.18653/v1/2023.ﬁ...

work page doi:10.18653/v1/2023 2023
[55]

Scientometrics 129(9), 5805–5813 (2024) https://doi.org/10.1007/s11192-024-05125-w

Oviedo-Garc ´ ıa, M.´A.: The review mills, not just (self-)plagiarism in review reports, but a step further. Scientometrics 129(9), 5805–5813 (2024) https://doi.org/10.1007/s11192-024-05125-w

work page doi:10.1007/s11192-024-05125-w 2024
[56]

F ACETS 9(1), 1–14 (2024) https://doi.org/10.1139/facets-2024-0102

Cooke, S.J., Young, N., Peiman, K.S., Roche, D.G., Clements, J.C., Ka dykalo, A.N., Provencher, J.F., Raghavan, R., DeRosa, M.C., Lennox, R.J., Rob inson Fayek, A., Cristescu, M.E., Murray, S.J., Quinn, J., Cobey, K.D., Browm an, H.I.: A harm reduction approach to improving peer review by acknowledgin g its imper- fections. F ACETS 9(1), 1–14 (2024) https...

work page doi:10.1139/facets-2024-0102 2024
[57]

In: First Conference on Language Modeling (2024)

Liang, W., Zhang, Y., Wu, Z., Lepp, H., Ji, W., Zhao, X., Cao, H., Liu, S ., He, S., Huang, Z., Yang, D., Potts, C., Manning, C.D., Zou, J.Y.: Mapping the increasing use of LLMs in scientiﬁc papers. In: First Conference on Language Modeling (2024). https://openreview.net/forum?id=YX7QnhxESU

2024
[58]

In: Proceedings of the 31st International Conference on Neural Information Processing Sy stems

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., G omez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Sy stems. NIPS’17, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA (2017 ) 46

2017