pith. machine review for the scientific record. sign in

arxiv: 2604.15460 · v1 · submitted 2026-04-16 · 💻 cs.HC · cs.AI

Recognition: unknown

The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings

Chi Ho Yeung, Chingyi Yeung, David James Woo, Hengky Susanto, Stephanie Wing Yan Lo-Philip

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:50 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords LLM assistanceEFL writingChatGPTwriting coherencescaffoldingstudent proficiencyAI in education
0
0 comments X

The pith

Advanced LLMs boost EFL writing scores and lexical diversity yet correlate with lower expert coherence ratings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares secondary EFL student compositions assisted by LLMs before and after ChatGPT to test whether newer models serve as genuine scaffolds or compensatory crutches. It reports that advanced LLMs raise assessment scores and lexical diversity especially for lower-proficiency writers, while greater LLM assistance shows a negative correlation with human expert ratings of deep coherence. A sympathetic reader would care because the pattern implies students may gain surface fluency without corresponding gains in independent thinking or structure. The authors conclude that pedagogy must shift from evaluating output quality alone to verifying the underlying learning process by distinguishing ideational support from full textual production within each learner's Zone of Proximal Development.

Core claim

Post-ChatGPT LLMs enhance quantitative measures such as lexical diversity and readability scores for EFL writers, particularly lower-proficiency learners, while increased LLM assistance correlates negatively with qualitative expert ratings, indicating surface fluency without deep coherence. Pedagogy must therefore differentiate ideational scaffolding from textual production and align AI functions with the learner's Zone of Proximal Development.

What carries the argument

Comparison of pre- and post-ChatGPT student compositions through expert qualitative scoring together with quantitative metrics including readability tests, MTLD, and Pearson's correlation coefficient.

If this is right

  • Lower-proficiency EFL learners receive measurable boosts in assessment scores and lexical diversity from advanced LLMs.
  • Greater LLM assistance can mask students' true current ability by supplying surface-level fluency.
  • Human expert ratings of coherence decline as LLM assistance increases, pointing to limits in deep understanding.
  • Effective use requires pedagogy to verify the learning process rather than judge final output quality alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Without explicit checks on the writing process, repeated LLM use may slow the development of independent coherence skills over a school year.
  • The same surface-versus-depth pattern could appear in other language tasks or subjects where AI supplies polished output.
  • A longitudinal study tracking the same students' unaided writing proficiency across varying levels of permitted LLM support would test whether the masking effect persists.

Load-bearing premise

Observed differences in student writings can be attributed primarily to changes in LLM capabilities rather than shifts in teaching practices, assignment design, or student proficiency over the same period.

What would settle it

A controlled comparison in which post-ChatGPT writings show stable or higher expert coherence ratings when teaching methods and student cohorts are held constant would falsify the claim that increased LLM assistance causes reduced deep coherence.

Figures

Figures reproduced from arXiv: 2604.15460 by Chi Ho Yeung, Chingyi Yeung, David James Woo, Hengky Susanto, Stephanie Wing Yan Lo-Philip.

Figure 1
Figure 1. Figure 1: Student performances who received assistant from early generation of LLMs (EarlyGen-LLM) and more advanced [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Improvement of more advance LLMs over the early version of LLMs [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Readability Test : Automated Readability Index (ARI), Coleman-Liau Index, Flesch-Kincaid Grade Level, Dale-Chall [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CLO score sorted according to number of AI text integrated in the writing. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Lexical analysis based on the sorted CLO scores. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pearson correlation between readability test and human / AI generated texts [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The correlation between number of AI and human generated words and C, L, and O scores. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Assessment Rubric 8 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

The rapid evolution of Large Language Models (LLMs) has made them powerful tools for enhancing student writing. This study explores the extent and limitations of LLMs in assisting secondary-level English as a Foreign Language (EFL) students with their writing tasks. While existing studies focus on output quality, our research examines the developmental shift in LLMs and their impact on EFL students, assessing whether smarter models act as true scaffolds or mere compensatory crutches. To achieve this, we analyse student compositions assisted by LLMs before and after ChatGPT's release, using both expert qualitative scoring and quantitative metrics (readability tests, Pearson's correlation coefficient, MTLD, and others). Our results indicate that advanced LLMs boost assessment scores and lexical diversity for lower-proficiency learners, potentially masking their true ability. Crucially, increased LLM assistance correlated negatively with human expert ratings, suggesting surface fluency without deep coherence. To transform AI-assisted practice into genuine learning, pedagogy must shift from focusing on output quality to verifying the learning process. Educators should align AI functions, specifically differentiating ideational scaffolding from textual production, within the learner's Zone of Proximal Development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper compares EFL secondary students' compositions assisted by pre-ChatGPT LLMs versus post-ChatGPT models. It employs expert qualitative scoring alongside quantitative metrics (readability tests, MTLD for lexical diversity, Pearson's correlation) to claim that advanced LLMs raise assessment scores and lexical diversity for lower-proficiency learners while increased LLM assistance negatively correlates with human expert ratings, interpreted as surface fluency without deep coherence. The authors conclude that pedagogy should shift from output quality to verifying the learning process and aligning AI use with the Zone of Proximal Development.

Significance. If the empirical claims survive methodological scrutiny, the work could usefully inform EFL pedagogy and HCI research on generative AI in education by documenting how model capability interacts with learner proficiency. The combination of qualitative expert judgment and quantitative measures (MTLD, readability) is a positive feature that allows triangulation of surface versus deeper writing qualities.

major comments (3)
  1. [Methods] Methods section: No sample size, participant demographics, number of compositions, or details on how 'LLM assistance levels' were quantified (self-report, usage logs, or otherwise) are reported. Without these, the negative correlation between assistance and expert ratings cannot be evaluated for statistical power or generalizability.
  2. [Results] Results/Discussion: The pre-post ChatGPT design attributes differences in writing to LLM generations, yet the manuscript provides no controls, matching, or covariates for concurrent changes in curriculum, assignment design, teacher practices, or student cohort proficiency. This leaves the central causal interpretation vulnerable to confounding.
  3. [Abstract] Abstract and Results: The claim that advanced LLMs 'mask true ability' for lower-proficiency learners rests on observed score boosts, but the paper does not report how proficiency was independently measured or how assistance was isolated from learner effort, undermining the masking interpretation.
minor comments (1)
  1. [Abstract] The abstract mentions 'Pearson's correlation coefficient' and 'MTLD' without defining the exact variables correlated or the MTLD implementation details; a brief methods paragraph would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their detailed feedback on our manuscript. We have carefully considered each major comment and provide our responses below, indicating the revisions we plan to make.

read point-by-point responses
  1. Referee: [Methods] Methods section: No sample size, participant demographics, number of compositions, or details on how 'LLM assistance levels' were quantified (self-report, usage logs, or otherwise) are reported. Without these, the negative correlation between assistance and expert ratings cannot be evaluated for statistical power or generalizability.

    Authors: We acknowledge the need for greater transparency in the Methods section. The revised manuscript will include detailed reporting of the sample size, participant demographics (including age, gender distribution, and EFL proficiency levels), the number of compositions analyzed, and the method for quantifying LLM assistance levels, which combined self-reported usage with analysis of writing process logs. These additions will support evaluation of statistical power and generalizability. revision: yes

  2. Referee: [Results] Results/Discussion: The pre-post ChatGPT design attributes differences in writing to LLM generations, yet the manuscript provides no controls, matching, or covariates for concurrent changes in curriculum, assignment design, teacher practices, or student cohort proficiency. This leaves the central causal interpretation vulnerable to confounding.

    Authors: We recognize that the pre-post design is susceptible to confounding from external factors. Our analysis did include student proficiency as a covariate and focused on comparative patterns across LLM generations. In the revision, we will expand the Discussion to explicitly address potential confounders, include any sensitivity analyses, and temper the causal language while emphasizing the observational nature of the findings and their implications for pedagogy. revision: partial

  3. Referee: [Abstract] Abstract and Results: The claim that advanced LLMs 'mask true ability' for lower-proficiency learners rests on observed score boosts, but the paper does not report how proficiency was independently measured or how assistance was isolated from learner effort, undermining the masking interpretation.

    Authors: Proficiency was independently measured using standardized EFL assessment tools prior to the writing tasks, and LLM assistance was isolated through a mixed-methods approach involving usage frequency reports and qualitative differentiation of text features. The masking interpretation is further supported by the negative correlation with expert ratings on deep coherence rather than surface features. We will update the abstract and results sections to clearly describe these measurement procedures and refine the interpretation to be more precise. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical observational study without self-referential derivations

full rationale

The paper presents an observational pre-post analysis of EFL student writings assisted by different LLM generations, relying on expert qualitative scoring, readability metrics, MTLD, and Pearson correlations. No equations, fitted parameters renamed as predictions, or derivation chains appear in the provided text or abstract. The central claim (negative correlation between LLM assistance level and expert ratings) is framed as an empirical finding rather than a mathematical reduction to inputs. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. This matches the default expectation for non-derivational empirical work; the design limitations noted in the skeptic take concern external validity and confounding, not circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The interpretation that negative correlation indicates 'masking of true ability' assumes expert ratings validly measure deep coherence and that LLM assistance level can be isolated from other factors; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Expert qualitative scoring reliably distinguishes surface fluency from deep coherence in student writing.
    Invoked when linking lower expert ratings to lack of genuine learning.
  • domain assumption Pre- and post-ChatGPT student compositions are comparable after controlling for other variables.
    Required for attributing changes to LLM generations.

pith-pipeline@v0.9.0 · 5516 in / 1254 out tokens · 31800 ms · 2026-05-10T09:50:15.039590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 49 canonical work pages · 5 internal anchors

  1. [1]

    2024. Poe. https://poe.com

  2. [2]

    Al Mahmud

    F. Al Mahmud. 2023. Investigating EFL Students’ Writing Skills Through Artificial Intelligence: Wordtune Application as a Tool.Journal of Language Teaching and Research14, 5 (2023), 1395–1404. https://doi.org/10.17507/jltr.1405.21

  3. [3]

    Algaraady and M

    J. Algaraady and M. Mahyoob. 2023. ChatGPT’s Capabilities in Spotting and Analyzing Writing Errors Experienced by EFL Learners. Arab World English Journal (A WEJ) Special Issue on CALL9 (2023), 3–17. https://doi.org/10.24093/awej/call9.1

  4. [4]

    Shivam Bansal and Chaitanya Aggarwal. 2025. textstat. https://pypi.org/project/textstat/. Accessed: 2024-12-17

  5. [5]

    1998.Corpus Linguistics: Investigating Language Structure and Use

    Douglas Biber, Susan Conrad, and Randi Reppen. 1998.Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press

  6. [6]

    O’Reilly Media, Inc

    Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc. "

  7. [7]

    R. A. Bjork. 1994.Memory and metamemory considerations in the training of human beings. The MIT Press. 185–205 pages

  8. [8]

    Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. URL: http://github.com/eleutherai/gpt-neo

  9. [9]

    Bonner, R

    E. Bonner, R. Lege, and E. Frazier. 2023. Large language model-based artificial intelligence in the language classroom: Practical ideas for teaching.Teaching English with Technology23, 1 (2023), 23–41. https://doi.org/10.56297/BKAM1691/WIEO1749

  10. [10]

    Bartosz Broda, Bartłomiej Nitoń, Włodzimierz Gruszczyński, and Maciej Ogrodniczuk. 2014. Measuring Readability of Polish Texts: Baseline Experiments. InProceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asun...

  11. [11]

    Noam Chomsky. 1956. Three Models for the Description of Language.IRE Transactions on Information Theory2, 3 (September 1956), 113–124

  12. [12]

    Meri Coleman and Ta Lin Liau. 1975. A Computer Readability Formula Designed for Machine Scoring.Journal of Applied Psychology60, 2 (1975), 283

  13. [13]

    Microsoft Corporation. 2025. Microsoft Excel. (2025). https://office.microsoft.com/excel

  14. [14]

    Creely and J

    E. Creely and J. Blannin. 2025. Creative partnerships with generative AI. Possibilities for education and beyond.Thinking Skills and Creativity56 (2025), 101727. https://doi.org/10.1016/j.tsc.2025.101727

  15. [15]

    Crossley

    Scott A. Crossley. 2025. Developing Linguistic Constructs of Text Readability Using Natural Language Processing.Scientific Studies of Reading29, 2 (2025), 138–160. https://doi.org/10.1080/10888438.2024.2422365

  16. [16]

    Edgar Dale and Jeanne S. Chall. 1948. A Formula for Predicting Readability.Educational Research Bulletin(1948), 37–54

  17. [17]

    H. B. Essel, D. Vlachopoulos, A. B. Essuman, and J. O. Amankwa. 2024. ChatGPT effects on cognitive skills of undergraduate students: Receiving instant responses from AI-based conversational large language models (LLMs).Computers and Education: Artificial Intelligence The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings•19 6...

  18. [18]

    A. S. Evmenova, K. Regan, R. Mergen, and R. Hrisseh. 2024. Improving Writing Feedback for Struggling Writers: Generative AI to the Rescue?TechTrends68, 4 (2024), 790–802. https://doi.org/10.1007/s11528-024-00965-y

  19. [19]

    H. Feng, K. Li, and L. J. Zhang. 2025. What does AI bring to second language writing? A systematic review (2014–2024).Language Learning & Technology29, 1 (2025), 1–27. https://doi.org/10.64152/10125/73619 Advance online publication

  20. [20]

    A. Z. Fitzsimons, E. M. Gerber, and D. Long. 2024. Overcoming challenges to personal narrative co-writing with AI: A participatory design approach for under-resourced high school students. InProceedings of the Third Workshop on Intelligent and Interactive Writing Assistants (In2Writing ’24). Association for Computational Linguistics. https://doi.org/10.11...

  21. [21]

    R. Flesch. 1948. A new readability yardstick.Journal of Applied Psychology32, 3 (1948), 221–233

  22. [22]

    Tomas Goldsack, Zheheng Luo, Qianqian Xie, Carolina Scarton, Matthew Shardlow, Sophia Ananiadou, and Chenghua Lin. 2023. Overview of the BioLaySumm 2023 Shared Task on Lay Summarization of Biomedical Research Articles. InProceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. Association for Computational Lingui...

  23. [23]

    R. Gunning. 1952.The Technique of Clear Writing. McGraw-Hill, New York

  24. [24]

    Kai Guo and Danling Li. 2024. Understanding EFL students’ use of self-made AI chatbots as personalized writing assistance tools: A mixed methods study.System124 (2024), 103362. https://doi.org/10.1016/j.system.2024.103362

  25. [25]

    Pei-Fu Guo, Ying-Hsuan Chen, Yun-Da Tsai, and Shou-De Lin. 2024. Towards Optimizing with Large Language Models. InFourth Workshop on Knowledge-infused Learning. https://openreview.net/forum?id=vIU8LUckb4

  26. [26]

    J. Han, H. Yoo, J. Myung, M. Kim, T. Y. Lee, S.-Y. Ahn, and A. Oh. 2024. Exploring Student-ChatGPT Dialogue in EFL Writing Education. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1–17

  27. [27]

    J. Han, H. Yoo, J. Myung, M. Kim, H. Lim, Y. Kim, T. Y. Lee, H. Hong, J. Kim, S.-Y. Ahn, and A. Oh. 2024. LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction.arXiv preprint arXiv:2310.05191v2(2024). https://arxiv.org/abs/2310. 05191v2

  28. [28]

    A. Harris. 2019. Defining advanced vocabulary in academic contexts.Journal of English for Academic Purposes34 (2019), 12–25

  29. [29]

    Jansen, A

    T. Jansen, A. Horbach, and J. Möller. 2024. Feedback from Generative AI: Correlates of Student Engagement in Text Revision from 655 Classes from Primary and Secondary School. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery. https://doi.org/10.1145/3613904.3642345

  30. [30]

    An Empirical Study to Understand How Students Use ChatGPT for Writing Essays

    A. Jelson, D. Manesh, A. Jang, D. Dunlap, Y.-H. Kim, and S. W. Lee. 2025. An empirical study to understand how students use ChatGPT for writing essays.arXiv preprint arXiv:2501.10551(2025). https://arxiv.org/abs/2501.10551

  31. [31]

    J. Jeon, L. Wei, K. W. H. Tai, and S. Lee. 2025. Generative AI and its dilemmas: exploring AI from a translanguaging perspective.Applied Linguistics46, 4 (2025), 709–717. https://doi.org/10.1093/applin/amaf049

  32. [32]

    Chao Jiang and Wei Xu. 2024. MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 17293–17319. https://doi.org/...

  33. [33]

    Jiang, Z

    Z. Jiang, Z. Xu, Z. Pan, J. He, and K. Xie. 2023. Exploring the Role of Artificial Intelligence in Facilitating Assessment of Writing Performance in Second Language Learning.Languages8, 4 (2023), 247. https://doi.org/10.3390/languages8040247

  34. [34]

    Why Language Models Hallucinate

    Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664 [cs.CL] https://arxiv.org/abs/2509.04664

  35. [35]

    J. Kim, R. C. Flanagan, N. E. Haviland, Z. Sun, S. N. Yakubu, E. A. Maru, and K. C. Arnold. 2024. Towards Full Authorship with AI: Supporting Revision with AI-Generated Views.arXiv preprint arXiv:2403.01055(2024). https://arxiv.org/abs/2403.01055

  36. [36]

    M. Kim, S. Kim, S. Lee, Y. Yoon, J. Myung, H. Yoo, H. Lim, J. Han, Y. Kim, S.-Y. Ahn, J. Kim, A. Oh, H. Hong, and T. Y. Lee. 2024. LLM-driven learning analytics dashboard for teachers in EFL writing education.arXiv preprint arXiv:2410.15025(2024). https: //arxiv.org/abs/2410.15025

  37. [37]

    Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom

    J. Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. 1975.Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Technical Report. ERIC

  38. [38]

    Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task

    N. Kosmyna, E. Hauptmann, Y. T. Yuan, J. Situ, X.-H. Liao, A. V. Beresnitzky, I. Braunstein, and P. Maes. 2025. Your Brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task.arXiv preprint arXiv:2506.08872(2025). https://arxiv.org/abs/2506.08872

  39. [39]

    Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. 2024. Understanding Catastrophic Forgetting in Language Models via Implicit Inference. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VrHiF2hsrm

  40. [40]

    2010.Multimodality: A Social Semiotic Approach to Contemporary Communication

    Gunther Kress. 2010.Multimodality: A Social Semiotic Approach to Contemporary Communication. Routledge, London

  41. [41]

    Kristopher Kyle. 2020. Lexical Diversity. https://pypi.org/project/lexical-diversity/

  42. [42]

    Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

    Pierre-Carl Langlais, Carlos Rosas Hinostroza, Mattia Nee, Catherine Arnett, Pavel Chizhov, Eliot Krzystof Jones, Irène Girard, David Mach, Anastasia Stasenko, and Ivan P. Yamshchikov. 2025. Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training. 20•Hengky Susanto, et al. ArXivabs/2506.01732 (2025). https://api.semanticscholar.org/Corp...

  43. [43]

    Leppänen, L

    L. Leppänen, L. Aunimo, J. K. Nurminen, A. Hellas, and L. Mannila. 2025. How large language models are changing MOOC essay answers: A comparison of pre- and post-LLM responses. InProceedings of the Third Workshop on Intelligent and Interactive Writing Assistants (In2Writing ’24). Association for Computational Linguistics. https://arxiv.org/abs/2504.13038

  44. [44]

    Haochuan Li, Jingyuan Li, Yi Zhao, Heng Zhang, Yukai Yang, Zile Hu, and Chengzhi Zhang. 2025. Measuring Research Difficulty in Academic Papers: A Case Study in Natural Language Processing.Data Science and Informetrics(2025). https://doi.org/10.1016/j.dsim. 2025.06.001

  45. [45]

    Liu, G.-J

    Z.-M. Liu, G.-J. Hwang, C.-Q. Chen, X.-D. Chen, and X.-D. Ye. 2024. Integrating large language models into EFL writing instruction: effects on performance, self-regulated learning strategies, and motivation.Computer Assisted Language Learning(2024), 1–32. https: //doi.org/10.1080/09588221.2024.2388211

  46. [46]

    Marzuki, Widiati, D

    U. Marzuki, Widiati, D. Rusdin, Darwin, and I. Indrawati. 2023. The impact of AI writing tools on the content and organization of students’ writing: EFL teachers’ perspective.Cogent Education10, 2 (2023), 2236469. https://doi.org/10.1080/2331186X.2023.2236469

  47. [47]

    Harry Mc Laughlin

    G. Harry Mc Laughlin. 1969. SMOG Grading-a New Readability Formula.Journal of Reading12, 8 (1969), 639–646

  48. [48]

    Philip Mccarthy and Scott Jarvis. 2010. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment.Behavior research methods42 (05 2010), 381–92. https://doi.org/10.3758/BRM.42.2.381

  49. [49]

    McCarthy

    Philip M. McCarthy. 2005.An Assessment of the Range and Usefulness of Lexical Diversity Measures and the Potential of the Measure of Textual, Lexical Diversity (MTLD). Ph. D. Dissertation. The University of Memphis

  50. [50]

    Meyer, T

    J. Meyer, T. Jansen, R. Schiller, L. W. Liebenow, M. Steinbach, A. Horbach, and J. Fleckenstein. 2024. Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions. Computers and Education: Artificial Intelligence6 (2024), 100199. https://doi.org/10.1016...

  51. [51]

    Salar Mohtaj, Sebastian Möller, Faraz Maschhur, Chuyang Wu, and Max Reinhard. 2022. A Transfer Learning Based Model for Text Readability Assessment in German. (07 2022). https://doi.org/10.48550/arXiv.2207.06265

  52. [52]

    Yancey, Ruidong Liu, Mirza Basim Baig, André Kenji Horie, and James Sharpnack

    Chenhao Niu, Kevin P. Yancey, Ruidong Liu, Mirza Basim Baig, André Kenji Horie, and James Sharpnack. 2024. Detecting LLM-Assisted Cheating on Open-Ended Writing Tasks on Language Proficiency Tests. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, Franck Dernoncourt, Daniel Preoţiuc-Pietro, and Anasta...

  53. [53]

    Daniela Oelke, David Spretke, Andreas Stoffel, and Daniel A. Keim. 2012. Visual Readability Analysis: How to Make Your Writings Easier to Read.IEEE Transactions on Visualization and Computer Graphics18, 5 (May 2012), 662–674. https://doi.org/10.1109/TVCG.2011.266

  54. [54]

    OpenAI. 2023. GPT-4 Technical Report.CoRRabs/2303.08774 (2023)

  55. [55]

    Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, and Felix Steffek. 2023. The cambridge law corpus: a dataset for legal AI research. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA...

  56. [56]

    1966.Gobbledygook has gotta go

    John O’hayre. 1966.Gobbledygook has gotta go. Technical Report. US Department of the Interior, Bureau of Land Management

  57. [57]

    K. Pearson. 1896. Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia.Philosophical Transactions A373 (1896), 253–318

  58. [58]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res.21, 1, Article 140 (Jan. 2020), 67 pages

  59. [59]

    Hannah Rashkin, Elizabeth Clark, Fantine Huot, and Mirella Lapata. 2025. Help Me Write a Story: Evaluating LLMs’ Ability to Generate Writing Feedback. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Associa...

  60. [60]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. InNeurIPS EMC2 Workshop

  61. [61]

    J.-Y. Seo. 2024. Exploring the Educational Potential of ChatGPT: AI-Assisted Narrative Writing for EFL College Students.Language Teaching Research Quarterly43 (2024), 1–21. https://doi.org/10.32038/ltrq.2024.43.01

  62. [62]

    Siddiqui, R

    M. Siddiqui, R. Pea, and H. Subramonyam. 2025. Script&Shift: A Layered Interface Paradigm for Integrating Content Development and Rhetorical Strategy with LLM Writing Assistants.arXiv preprint arXiv:2502.10638(2025). https://arxiv.org/abs/2502.10638

  63. [63]

    E. A. Smith and R. J. Senter. 1967.Automated readability index. Technical Report. AMRL TR. 1–14 pages. PMID: 5302480

  64. [64]

    Song and Y

    C. Song and Y. Song. 2023. Enhancing academic writing skills and motivation: assessing the efficacy of ChatGPT in AI-assisted language learning for EFL students.Frontiers in Psychology14 (2023), 1260843. https://doi.org/10.3389/fpsyg.2023.1260843

  65. [65]

    Matthias Stadler, Maria Bannert, and Michael Sailer. 2024. Cognitive ease at a cost: LLMs reduce mental effort but compromise depth in student scientific inquiry.Computers in Human Behavior160 (2024), 108386. https://doi.org/10.1016/j.comphumbeh.2024.108386

  66. [66]

    Y. Su, Y. Lin, and C. Lai. 2023. Collaborating with ChatGPT in argumentative writing classrooms.Assessing Writing57 (2023), 100752. https://doi.org/10.1016/j.asw.2023.100752 The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings•21

  67. [67]

    Hakyung Sung, Karla Csuros, and Min-Chang Sung. 2025. Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features. InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025). Association for Computational Linguistics, Vienna, Austria, 11–23

  68. [68]

    Hengky Susanto, David James Woo, and Kai Guo. 2023. The Role of AI in Human-AI Creative Writing for Hong Kong Secondary Students. arXiv: 2304.11276 [cs.CL] https://arxiv.org/abs/2304.11276

  69. [69]

    ChatGPT is the companion, not enemies

    Meng F. Teng. 2024. "ChatGPT is the companion, not enemies": EFL learners’ perceptions and experiences in using ChatGPT for feedback in writing.Computers and Education: Artificial Intelligence7 (2024), 100270. https://doi.org/10.1016/j.caeai.2024.100270

  70. [70]

    Meng F. Teng. 2025. Metacognitive Awareness and EFL Learners’ Perceptions and Experiences in Utilising ChatGPT for Writing Feedback.European Journal of Education60 (2025), e12811. https://doi.org/10.1111/ejed.12811

  71. [71]

    Ugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models.CoRRabs/2302.13971 (2023)

  72. [72]

    Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. 2024. Position: will we run out of data? limits of LLM scaling based on human-generated data. InProceedings of the 41st International Conference on Machine Learning (Vienna, Austria)(ICML’24). JMLR.org, Article 2024, 22 pages

  73. [73]

    L. S. Vygotsky. 1978.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge, MA

  74. [74]

    Budi Waluyo and Farid Rouaghe. 2025. Beyond Teacher-Led Approaches: Student-Initiated Translanguaging With Artificial Intelligence Tools in Foreign Language Acquisition.SAGE Open15, 3 (2025). https://doi.org/10.1177/21582440251362998

  75. [75]

    Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/ kingoflolz/mesh-transformer-jax

  76. [76]

    Hong Wang and L. Hao. 2024. AI Integration in Translanguaging Practices in University EFL Classrooms: Shaping New Pedagogical Possibilities.SSRN(2024). https://doi.org/10.2139/ssrn.5601959

  77. [77]

    Wasi, Islam Mohammad R., and R

    Asiful T. Wasi, Islam Mohammad R., and R. Islam. 2024. LLMs as Writing Assistants: Exploring Perspectives on Sense of Ownership and Reasoning.arXiv preprintarXiv:2404.00027 (2024). https://arxiv.org/abs/2404.00027

  78. [78]

    David James Woo, Kai Guo, and Hengky Susanto. 2025. Exploring EFL students’ prompt engineering in human-AI story writing: An activity theory perspective.Interactive Learning Environments33(1) (2025), 863–882. https://doi.org/10.1080/10494820.2024.2361381

  79. [79]

    David James Woo, Hengky Susanto, Chi Ho Yeung, and Kai Guo. 2025. Approaching the Limits to EFL Writing Enhancement with AI-generated Text and Diverse Learners. arXiv: 2503.00367 [cs.CL] https://arxiv.org/abs/2503.00367

  80. [80]

    D. J. Woo, H. Susanto, C. H. Yeung, K. Guo, and A. K. Y. Fung. 2024. Exploring AI-Generated Text in Student Writing: How Does AI Help?Language Learning & Technology28, 2 (2024), 183–209. https://doi.org/10.64152/10125/73577

Showing first 80 references.