pith. machine review for the scientific record. sign in

arxiv: 2604.19578 · v1 · submitted 2026-04-21 · 💻 cs.CL · cs.AI· cs.DL· cs.IR

Recognition: unknown

Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DLcs.IR
keywords peer reviewlarge language modelsLLMAI conferencesevaluation aspectslinguistic analysisreview reportsacademic publishing
0
0 comments X

The pith

LLMs have made peer review texts in AI conferences longer and more focused on summaries, while reducing attention to originality and critical reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies changes in peer review reports from major AI conferences since large language models became available. It finds that these reports grew longer and more fluent, with reviewers putting more weight on summaries and surface-level clarity. Meanwhile, comments on deeper issues like originality, replicability, and detailed criticism became less common, especially from reviewers who rate their own confidence lower. This shift is important because peer review aims to improve manuscript quality through careful evaluation, and if the balance moves toward easier surface checks, it could change what research gets published. The analysis relies on measuring word and sentence features plus automatic tagging of what each sentence evaluates in the review.

Core claim

Following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

What carries the argument

Fine-grained automatic annotation of evaluation aspects in individual review sentences, paired with maximum likelihood estimation to identify likely LLM-assisted reports for comparison across time periods.

If this is right

  • Reviews place greater emphasis on summaries and surface-level clarity.
  • Attention to originality, replicability, and nuanced critical reasoning has decreased.
  • Linguistic patterns have become more standardized, especially among lower-confidence reviewers.
  • The informativeness of recommendation signals for paper decisions may be impacted by these focus shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pattern holds, AI conferences may need to revise review guidelines to maintain depth in evaluations.
  • Over time, this could affect the overall rigor and direction of published AI research.
  • Researchers could examine whether papers accepted under more LLM-influenced reviews show different long-term impact.
  • Tools to assist reviewers might be designed to prompt for deeper critical analysis rather than just fluency.

Load-bearing premise

The changes in review length, fluency, and evaluative focus result from LLM adoption rather than other concurrent changes in the field or review process.

What would settle it

A study that compares reviews from the same reviewers before and after LLM availability, or directly measures the proportion of LLM-generated content in reviews and correlates it with the observed changes.

read the original abstract

With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript examines changes in peer-review reports from top AI conferences following the emergence of LLMs. It analyzes linguistic features (length, word/sentence complexity, fluency), applies automatic sentence-level annotation of evaluation aspects (e.g., summary, clarity, originality, replicability), uses a previously established MLE method to flag potentially LLM-assisted reviews, and assesses how aspect distributions in those reviews relate to recommendation informativeness. The central claim is that post-LLM reviews are longer and more fluent, emphasize summaries and surface-level clarity with more standardized patterns (especially among low-confidence reviewers), while showing reduced attention to deeper dimensions such as originality, replicability, and nuanced critical reasoning.

Significance. If the causal attribution and measurement validity hold, the work would offer a timely fine-grained view of how LLMs may be reshaping evaluative practices in AI peer review, with potential implications for review quality and scientific standards. The sentence-level aspect analysis and integration of linguistic metrics with LLM detection constitute a methodological strength over coarser aggregate studies. The paper would benefit from explicit validation steps to elevate its contribution.

major comments (3)
  1. [§3.3] §3.3 (LLM detection): The MLE-based identification of LLM-assisted reviews is applied without reported validation, accuracy metrics, or robustness checks on peer-review text; because this classification partitions the data for all subsequent contrasts, any systematic bias in detection directly undermines the attribution of shifts to LLM use.
  2. [§4] §4 (aspect annotation): The automatic sentence-level aspect classifier is deployed without human validation, inter-annotator agreement, or error analysis on the peer-review corpus; the reported decline in originality/replicability focus and rise in summary emphasis therefore cannot be distinguished from possible annotation artifacts.
  3. [§5] §5 (results): The before/after design reports directional changes in length, fluency, and aspect distributions but includes no statistical controls or matching for confounders such as reviewer demographics, conference guideline updates, or submission-volume growth; these omissions leave the causal claim load-bearing yet untested.
minor comments (3)
  1. [Abstract] Abstract and §2: The phrase 'reviewers with lower confidence score' is used without defining how confidence is measured or whether it is self-reported by reviewers.
  2. [§3.1] §3.1: The exact formulas or libraries for 'complexity of words and sentences' (e.g., specific readability indices or syntactic metrics) should be stated explicitly rather than left at the level of 'linguistic features'.
  3. [Table 1] Table 1 or equivalent: Sample sizes, period boundaries, and number of reviews per conference should be reported with exact counts to allow assessment of statistical power.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with our responses and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (LLM detection): The MLE-based identification of LLM-assisted reviews is applied without reported validation, accuracy metrics, or robustness checks on peer-review text; because this classification partitions the data for all subsequent contrasts, any systematic bias in detection directly undermines the attribution of shifts to LLM use.

    Authors: The MLE detection method follows the previously established and validated approach from prior work on LLM-generated text detection. We will revise §3.3 to explicitly report the original method's validation metrics (e.g., accuracy on benchmark datasets) and add robustness checks tailored to peer-review text, including sensitivity analysis across thresholds and a small human-annotated sample to evaluate performance in this domain. This addresses potential bias concerns while preserving the partitioning for subsequent analyses. revision: yes

  2. Referee: [§4] §4 (aspect annotation): The automatic sentence-level aspect classifier is deployed without human validation, inter-annotator agreement, or error analysis on the peer-review corpus; the reported decline in originality/replicability focus and rise in summary emphasis therefore cannot be distinguished from possible annotation artifacts.

    Authors: We recognize the need for validation of the automatic aspect classifier. We will add a dedicated validation subsection (or appendix) that includes human annotation on a representative sample of review sentences, reporting inter-annotator agreement metrics such as Cohen's kappa and a detailed error analysis. This will help confirm that shifts in aspect distributions reflect genuine changes rather than classifier artifacts. revision: yes

  3. Referee: [§5] §5 (results): The before/after design reports directional changes in length, fluency, and aspect distributions but includes no statistical controls or matching for confounders such as reviewer demographics, conference guideline updates, or submission-volume growth; these omissions leave the causal claim load-bearing yet untested.

    Authors: The study employs an observational before/after design to document temporal shifts post-LLM emergence. We will revise §5 and the discussion to incorporate additional controls for available factors (e.g., conference and year effects) and explicitly discuss potential confounders such as guideline updates and submission growth. We will also temper causal language to emphasize correlational findings and acknowledge limitations where full matching (e.g., for reviewer demographics) is not feasible due to data constraints. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper performs an observational before-after analysis of peer-review texts using standard linguistic metrics (length, fluency, word/sentence complexity) and automatic sentence-level aspect annotation. It additionally applies a previously established MLE method to flag LLM-assisted reports and then compares aspect distributions and recommendation informativeness. No equations, fitted parameters, or derived quantities are defined in terms of the target results themselves; the before-after shifts and LLM-assisted contrasts are computed directly from the data splits and external tools rather than reducing tautologically to inputs. The MLE reference is invoked as an established detection tool rather than a self-derived uniqueness theorem or load-bearing premise that forces the headline findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the analysis rests on standard NLP feature extraction and a previously published MLE procedure whose assumptions are not restated here.

pith-pipeline@v0.9.0 · 5585 in / 1247 out tokens · 49462 ms · 2026-05-10T01:45:49.521032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    Proceedings of the National Academy of Sciences 112(2), 360–365 (2015) https://doi.org/10.1073/pnas.1418218112 40

    Siler, K., Lee, K., Bero, L.: Measuring the effectiveness of scientifi c gatekeep- ing. Proceedings of the National Academy of Sciences 112(2), 360–365 (2015) https://doi.org/10.1073/pnas.1418218112 40

  2. [2]

    Scientometrics 113(1), 633–650 (2017) https://doi.org/10.1007/s11192-017-2310-5

    Huisman, J., Smits, J.: Duration and quality of the peer review pro- cess: the author’s perspective. Scientometrics 113(1), 633–650 (2017) https://doi.org/10.1007/s11192-017-2310-5

  3. [3]

    Scientometrics 130(3), 1547–1569 (2025) https://doi.org/10.1007/s11192-025-05264-8

    Yu, H., Liang, Y., Xie, Y.: Understanding the sustainability of supply–demand in peer review system: an analysis based on scholars ’ research and review activities. Scientometrics 130(3), 1547–1569 (2025) https://doi.org/10.1007/s11192-025-05264-8

  4. [4]

    FEMS Microbiolo gy Letters 365(19), 204 (2018) https://doi.org/10.1093/femsle/fny204

    Tennant, J.P.: The state of the art in peer review. FEMS Microbiolo gy Letters 365(19), 204 (2018) https://doi.org/10.1093/femsle/fny204

  5. [5]

    https://arxiv.org/abs/2106.00810

    Russo, A.: Some Ethical Issues in the Review Process of Machine L earning Conferences (2021). https://arxiv.org/abs/2106.00810

  6. [6]

    https://openreview.net/forum?id=Cn706AbJaKW

    Tran, D., Valtchanov, A.V., Ganapathy, K.R., Feng, R., Slud, E.V., Go ld- blum, M., Goldstein, T.: An Open Review of OpenReview: A Criti- cal Analysis of the Machine Learning Conference Review Process (2 021). https://openreview.net/forum?id=Cn706AbJaKW

  7. [7]

    Stelmakh, I., Shah, N.B., Singh, A., Daum´ e, H.: Prior and prejudice : The novice reviewers’ bias against resubmissions in conference peer review. Proc. ACM Hum.- Comput. Interact. 5(CSCW1) (2021) https://doi.org/10.1145/3449149

  8. [8]

    https://arxiv.org/abs/2211.06398

    Zhang, J., Zhang, H., Deng, Z., Roth, D.: Investigating Fairness D is- parities in Peer Review: A Language Model Enhanced Approach (202 2). https://arxiv.org/abs/2211.06398

  9. [9]

    Functional Ecology 37(5), 1144–1157 (2023) https://doi.org/10.1111/1365-2435.14259

    Fox, C.W., Meyer, J., Aim´ e, E.: Double-blind peer review affects rev iewer ratings and editor decisions at an ecology journal. Functional Ecology 37(5), 1144–1157 (2023) https://doi.org/10.1111/1365-2435.14259

  10. [10]

    In: Proceedings of the 37th International Conference on Neura l Information Pro- cessing Systems

    Lu, Y., Kong, Y.: Calibrating ”cheap signals” in peer review without a prior. In: Proceedings of the 37th International Conference on Neura l Information Pro- cessing Systems. NIPS ’23. Curran Associates Inc., Red Hook, NY, USA (2024). https://doi.org/10.5555/3666122.3667028

  11. [11]

    In: Proceedings of the 37th Int ernational Con- ference on Neural Information Processing Systems

    Xu, Y.E., Jecmen, S., Song, Z., Fang, F.: A one-size-fits-all appr oach to improving randomness in paper assignment. In: Proceedings of the 37th Int ernational Con- ference on Neural Information Processing Systems. NIPS ’23. Cu rran Associates Inc., Red Hook, NY, USA (2024)

  12. [12]

    https://arxiv.org/abs/2310.05966

    Liu, Y., Yang, K., Liu, Y., Drew, M.G.B.: The Shackles of Peer Review: Unveiling the Flaws in the Ivory Tower (2023). https://arxiv.org/abs/2310.05966

  13. [14]

    Yuan, W., Liu, P., Neubig, G.: Can we automate scientific reviewing? J. Artif. Int. Res. 75 (2022) https://doi.org/10.1613/jair.1.12862

  14. [15]

    Proceedings of the AAAI Conference on Artifi cial Intelligence 36(10), 11639–11647 (2022) https://doi.org/10.1609/aaai.v36i10.21418

    Yuan, W., Liu, P.: Kid-review: Knowledge-guided scientific review g eneration with oracle pre-training. Proceedings of the AAAI Conference on Artifi cial Intelligence 36(10), 11639–11647 (2022) https://doi.org/10.1609/aaai.v36i10.21418

  15. [19]

    In: Al-Onaizan, Y., B ansal, M., Chen, Y.-N

    Jin, Y., Zhao, Q., Wang, Y., Chen, H., Zhu, K., Xiao, Y., Wang, J.: Age ntReview: Exploring peer review dynamics with LLM agents. In: Al-Onaizan, Y., B ansal, M., Chen, Y.-N. (eds.) Proceedings of the 2024 Conference on Empir ical Methods in Natural Language Processing, pp. 1208–1226. Association for Computational Linguistics, Miami, Florida, USA (2024). ...

  16. [20]

    BMJ 367 (2019) https://doi.org/10.1136/bmj.l6460

    Barnett, A., Mewburn, I., Schroter, S.: Working 9 to 5, not the way to make an academic living: observational analysis of manuscript and peer revie w submissions over time. BMJ 367 (2019) https://doi.org/10.1136/bmj.l6460

  17. [21]

    Geng, Y., Cao, R., Han, X., Tian, W., Zhang, G., Wang, X.: Scientists are working overtime: when do scientists download scientific papers? Scientome trics 127(11), 6413–6429 (2022) https://doi.org/10.1007/s11192-022-04524-1

  18. [22]

    GPT-4 Technical Report

    OpenAI: GPT-4 Technical Report (2024). https://arxiv.org/abs/2303.08774

  19. [23]

    The Lancet Infectious Diseases 23(7), 781 (2023) https://doi.org/10.1016/S1473-3099(23)00290-6

    Donker, T.: The dangers of using large language models for peer review. The Lancet Infectious Diseases 23(7), 781 (2023) https://doi.org/10.1016/S1473-3099(23)00290-6

  20. [24]

    42 Nature 628(8008), 483–484 (2024) https://doi.org/10.1038/d41586-024-01051-2

    Chawla, D.S.: Is chatgpt corrupting peer review? telltale words h int at ai use. 42 Nature 628(8008), 483–484 (2024) https://doi.org/10.1038/d41586-024-01051-2

  21. [25]

    McFarland, and James Zou

    Liang, W., Zhang, Y., Cao, H., Wang, B., Ding, D.Y., Yang, X., Vodrah alli, K., He, S., Smith, D.S., Yin, Y., McFarland, D.A., Zou, J.: Can large language m od- els provide useful feedback on research papers? a large-scale em pirical analysis. NEJM AI 1(8), 2400196 (2024) https://doi.org/10.1056/AIoa2400196

  22. [26]

    In: Salakhutdinov, R., Kolter, Z., Heller, K., W eller, A., Oliver, N., Scarlett, J., Berkenkamp, F

    Liang, W., Izzo, Z., Zhang, Y., Lepp, H., Cao, H., Zhao, X., Chen, L ., Ye, H., Liu, S., Huang, Z., McFarland, D., Zou, J.Y.: Monitoring AI-modified content at scale: A case study on the impact of ChatGPT on AI con- ference peer reviews. In: Salakhutdinov, R., Kolter, Z., Heller, K., W eller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Proceedings ...

  23. [27]

    arXiv preprint arXiv:2405.02150

    Latona, G.R., Ribeiro, M.H., Davidson, T.R., Veselovsky, V., West, R .: The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Pape r Scores and Acceptance Rates (2024). https://arxiv.org/abs/2405.02150

  24. [29]

    A Dataset of Peer Reviews ( P eer R ead): Collection, Insights and NLP Applications

    Kang, D., Ammar, W., Dalvi, B., Zuylen, M., Kohlmeier, S., Hovy, E., Sc hwartz, R.: A dataset of peer reviews (PeerRead): Collection, insights and N LP applica- tions. In: Proceedings of the 2018 Conference of the North Amer ican Chapter of the Association for Computational Linguistics: Human Language Te chnologies, Volume 1 (Long Papers), pp. 1647–1661. ...

  25. [30]

    In: Davis, B., Graham, Y., Kelleher, J., Sripada, Y

    Wang, Q., Zeng, Q., Huang, L., Knight, K., Ji, H., Rajani, N.F.: Review Robot: Explainable paper review generation based on knowledge synthesis. In: Davis, B., Graham, Y., Kelleher, J., Sripada, Y. (eds.) Proceedings of the 13th International Conference on Natural Language Gener ation, pp. 384–397. Association for Computational Linguistics, Dublin, Irelan...

  26. [31]

    In: Chandrasekaran, M.K., Waard, A., Feigenblat, G., Freitag, D., Ghosal, T., Hovy, E., Knoth, P., Konopnicki, D., Mayr, P., Patton, R.M., Shmueli-Scheuer, M

    Li, J., Sato, A., Shimura, K., Fukumoto, F.: Multi-task peer-revie w score predic- tion. In: Chandrasekaran, M.K., Waard, A., Feigenblat, G., Freitag, D., Ghosal, T., Hovy, E., Knoth, P., Konopnicki, D., Mayr, P., Patton, R.M., Shmueli-Scheuer, M. (eds.) Proceedings of the First Workshop on Scholarly Document Process- ing, pp. 121–126. Association for Com...

  27. [32]

    43 https://arxiv.org/abs/2502.19614

    Yu, S., Luo, M., Madusu, A., Lal, V., Howard, P.: Is Your Paper Bein g Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review ( 2025). 43 https://arxiv.org/abs/2502.19614

  28. [33]

    https://arxiv.org/abs/2502.11736

    Garg, M.K., Prasad, T., Singhal, T., Kirtani, C., Mandal, M., Kumar, D .: ReviewEval: An Evaluation Framework for AI-Generated Reviews (2 025). https://arxiv.org/abs/2502.11736

  29. [34]

    Information Fusion 124, 103332 (2025) https://doi.org/10.1016/j.inffus.2025.103332

    Zhuang, Z., Chen, J., Xu, H., Jiang, Y., Lin, J.: Large language mod els for auto- mated scholarly paper review: A survey. Information Fusion 124, 103332 (2025) https://doi.org/10.1016/j.inffus.2025.103332

  30. [35]

    https://arxiv.org/abs/2307.05492

    Robertson, Z.: GPT4 is Slightly Helpful for Peer-Review Assistan ce: A Pilot Study (2023). https://arxiv.org/abs/2307.05492

  31. [36]

    https://arxiv.org/abs/2306.00622

    Liu, R., Shah, N.B.: ReviewerGPT? An Exploratory Study on Using L arge Language Models for Paper Reviewing (2023). https://arxiv.org/abs/2306.00622

  32. [37]

    Thelwall, M.: Can chatgpt evaluate research quality? Journal of Data and Information Science 9(2), 1–21 (2024) https://doi.org/10.2478/jdis-2024-0013

  33. [38]

    In: Calzolari, N., Kan , M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N

    Zhou, R., Chen, L., Yu, K.: Is LLM a reliable reviewer? a comprehen sive evalua- tion of LLM on automatic paper reviewing tasks. In: Calzolari, N., Kan , M.-Y., Hoste, V., Lenci, A., Sakti, S., Xue, N. (eds.) Proceedings of the 202 4 Joint International Conference on Computational Linguistics, Langua ge Resources and Evaluation (LREC-COLING 2024), pp. 934...

  34. [39]

    In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

    Du, J., Wang, Y., Zhao, W., Deng, Z., Liu, S., Lou, R., Zou, H.P., Narayanan Venkit, P., Zhang, N., Srinath, M., Zhang, H.R., Gupta, V., Li, Y., Li, T., Wang, F., Liu, Q., Liu, T., Gao, P., Xia, C., Xing, C., Jiayang, C., Wang, Z., Su, Y., Shah, R.S., Guo, R., Gu, J., Li, H., Wei, K., Wang, Z., Chen g, L., Ranathunga, S., Fang, M., Fu, J., Liu, F., Huang,...

  35. [40]

    Alireza Ghafarollahi and Markus J Buehler

    Gao, Z., Brantley, K., Joachims, T.: Reviewer2: Optimizing Review G eneration Through Prompt Generation (2024). https://arxiv.org/abs/2402.10886

  36. [41]

    Qwen Team

    Tan, C., Lyu, D., Li, S., Gao, Z., Wei, J., Ma, S., Liu, Z., Li, S.Z.: Peer Re view as A Multi-Turn and Long-Context Dialogue with Role-Based Interactio ns (2024). https://arxiv.org/abs/2406.05688

  37. [42]

    Marg: Multi-agent review generation for scientific papers.ArXiv, abs/2401.04259,

    D’Arcy, M., Hope, T., Birnbaum, L., Downey, D.: MARG: Multi-Agent Review Generation for Scientific Papers (2024). https://arxiv.org/abs/2401.04259 44

  38. [43]

    In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

    Yu, J., Ding, Z., Tan, J., Luo, K., Weng, Z., Gong, C., Zeng, L., Cui, R ., Han, C., Sun, Q., Wu, Z., Lan, Y., Li, X.: Automated peer reviewing in paper SEA: Stan- dardization, evaluation, and analysis. In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N. (eds.) Findings of the Association for Computational Linguistics: EM NLP 2024, pp. 10164–10184. Association fo...

  39. [44]

    I n: Editors: Shushanik Sargsyan, Wolfgang Gl¨ anzel, Giovanni Abramo: 20th In ternational Conference On Scientometrics & Informetrics, pp

    Wu, W., Zhang, Y., Haunschild, R., Bornmann, L.: Leveraging large language models for post-publication peer review: Potential and limitations. I n: Editors: Shushanik Sargsyan, Wolfgang Gl¨ anzel, Giovanni Abramo: 20th In ternational Conference On Scientometrics & Informetrics, pp. 23–27 (2025)

  40. [45]

    https://openreview.net/forum?id=0CfKpG94co

    GENG, M., Trotta, R.: Is chatGPT transforming academics’ writ ing style? In: ICML 2024 Next Generation of AI Safety Workshop (202 4). https://openreview.net/forum?id=0CfKpG94co

  41. [46]

    In: Araujo, P.H., Baumann, A., Gromann, D., Krenn, B., Roth, B., Wiegand, M

    Evans, J., D’Souza, J., Auer, S.: Large language models as evalua tors for scien- tific synthesis. In: Araujo, P.H., Baumann, A., Gromann, D., Krenn, B., Roth, B., Wiegand, M. (eds.) Proceedings of the 20th Conference on Natura l Language Pro- cessing (KONVENS 2024), pp. 1–22. Association for Computationa l Linguistics, Vienna, Austria (2024). https://acla...

  42. [47]

    Human – Differentiation Analysis of Scientific Content Generation (20 23)

    Ma, Y., Liu, J., Yi, F., Cheng, Q., Huang, Y., Lu, W., Liu, X.: AI vs. Human – Differentiation Analysis of Scientific Content Generation (20 23). https://arxiv.org/abs/2301.10416

  43. [48]

    In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

    Mitchell, E., Lee, Y., Khazatsky, A., Manning, C.D., Finn, C.: Detect gpt: Zero- shot machine-generated text detection using probability curvatu re. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Hon olulu, Hawaii, USA. Proceedings of Machine...

  44. [49]

    In: The Twelfth International Conference on Learning Represen tations (2024)

    Yang, X., Cheng, W., Wu, Y., Petzold, L.R., Wang, W.Y., Chen, H.: DNA -GPT: Divergent n-gram analysis for training-free detection of GPT-gen erated text. In: The Twelfth International Conference on Learning Represen tations (2024). https://openreview.net/forum?id=Xlayxj2fWp

  45. [50]

    In: Ku, L.-W ., Mar- tins, A., Srikumar, V

    Li, Y., Li, Q., Cui, L., Bi, W., Wang, Z., Wang, L., Yang, L., Shi, S., Zhang , Y.: MAGE: Machine-generated text detection in the wild. In: Ku, L.-W ., Mar- tins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meetin g of the Association for Computational Linguistics (Volume 1: Long Pape rs), pp. 36–53. Association for Computational Linguistics, B...

  46. [51]

    Crossley

    KYLE, K., CROSSLEY, S.A.: Measuring syntactic complexity in l2 writ - ing using fine-grained clausal and phrasal indices. The Modern Lan- guage Journal 102(2), 333–349 (2018) https://doi.org/10.1111/modl.12468 https://onlinelibrary.wiley.com/doi/pdf/10.1111/modl.12468

  47. [52]

    Behavior Research Methods 50(3), 1030–1046 (2018) https://doi.org/10.3758/s13428-017-0924-4

    Kyle, K., Crossley, S., Berger, C.: The tool for the automatic an alysis of lexical sophistication (taales): version 2.0. Behavior Research Methods 50(3), 1030–1046 (2018) https://doi.org/10.3758/s13428-017-0924-4

  48. [53]

    Maddi, A., Miotti, L.: On the peer review reports: does size matte r? Scientomet- rics 129(10), 5893–5913 (2024) https://doi.org/10.1007/s11192-024-04977-6

  49. [54]

    In: Vlachos, A., Augen- stein, I

    Guo, Y., Shang, G., Rennard, V., Vazirgiannis, M., Clavel, C.: Autom atic analysis of substantiation in scientific peer reviews. In: Bouamor, H., Pino, J ., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EM NLP 2023, pp. 10198–10216. Association for Computational Linguistics, Sing apore (2023). https://doi.org/10.18653/v1/2023.fi...

  50. [55]

    Scientometrics 129(9), 5805–5813 (2024) https://doi.org/10.1007/s11192-024-05125-w

    Oviedo-Garc ´ ıa, M.´A.: The review mills, not just (self-)plagiarism in review reports, but a step further. Scientometrics 129(9), 5805–5813 (2024) https://doi.org/10.1007/s11192-024-05125-w

  51. [56]

    F ACETS 9(1), 1–14 (2024) https://doi.org/10.1139/facets-2024-0102

    Cooke, S.J., Young, N., Peiman, K.S., Roche, D.G., Clements, J.C., Ka dykalo, A.N., Provencher, J.F., Raghavan, R., DeRosa, M.C., Lennox, R.J., Rob inson Fayek, A., Cristescu, M.E., Murray, S.J., Quinn, J., Cobey, K.D., Browm an, H.I.: A harm reduction approach to improving peer review by acknowledgin g its imper- fections. F ACETS 9(1), 1–14 (2024) https...

  52. [57]

    In: First Conference on Language Modeling (2024)

    Liang, W., Zhang, Y., Wu, Z., Lepp, H., Ji, W., Zhao, X., Cao, H., Liu, S ., He, S., Huang, Z., Yang, D., Potts, C., Manning, C.D., Zou, J.Y.: Mapping the increasing use of LLMs in scientific papers. In: First Conference on Language Modeling (2024). https://openreview.net/forum?id=YX7QnhxESU

  53. [58]

    In: Proceedings of the 31st International Conference on Neural Information Processing Sy stems

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., G omez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Sy stems. NIPS’17, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA (2017 ) 46