Recognition: unknown
Characterizing Students' LLM Usage Behaviors and Their Association with Learning in Critical Thinking Tasks
Pith reviewed 2026-05-08 16:55 UTC · model grok-4.3
The pith
Refined categorization of how students use LLMs in paper critique homework links usage types and initiative levels to midterm performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish a refined bottom-up categorization of LLM usage types drawn from student reports, with each type labeled by the extent of student initiative involved in the interaction. They further examine associations between these categories, overall usage frequency, and performance on midterms that measure critical thinking, using data from two course offerings to extend prior single-offering findings.
What carries the argument
The bottom-up categorization of LLM usage types cross-labeled by the degree of student initiative required in each interaction.
If this is right
- Usage types that preserve higher student initiative may align with stronger midterm performance in critical thinking tasks.
- Frequency of LLM use by itself may show weaker connections to learning outcomes than the specific type of use.
- Patterns identified across two course offerings indicate repeatable behaviors that can be anticipated in similar settings.
- The initiative dimension offers a practical lever for distinguishing supportive from less supportive LLM practices.
Where Pith is reading between the lines
- Course designers could test prompts or guidelines that steer students toward higher-initiative LLM uses to support critical thinking practice.
- Validation of self-reports against tool logs would strengthen the evidence base for these associations.
- Applying the same initiative-labeled categorization to other subjects like problem-solving could reveal whether the pattern holds beyond paper critique.
Load-bearing premise
Students' self-reported LLM usage practices accurately reflect their actual behaviors on the homework assignments and that performance on the three midterms validly measures learning gains in critical thinking from those activities.
What would settle it
Direct logging or observation of students' actual LLM interactions during the assignments to check agreement with self-reports, or use of a pre-post assessment specifically targeting paper critique skills instead of relying on midterm scores.
Figures
read the original abstract
Large language models (LLMs) are becoming increasingly embedded in students' learning practices, yet much of what is known about how students use LLMs and how this usage impacts learning comes from problem-solving domains or constrained experimental settings. We present an analysis of data on LLM usage collected during two offerings of a research-oriented course where students learn to read, reason about, and critique academic papers. Without restrictions on whether or how to use LLMs, students reported their LLM usage practices when asked to do these activities as a series of homework assignments during the course. This paper extends prior work done on data from a single offering of the same course by presenting a refined bottom-up categorization of LLM usage types, cross-labeled by the extent of student initiative these usages entail. Furthermore, we examine how LLM use impacts student learning, measured by performance on three midterms, looking at factors such as frequency and type of usage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes self-reported LLM usage data collected from students in two offerings of a research-oriented course on reading, reasoning about, and critiquing academic papers. It refines a bottom-up taxonomy of LLM usage types cross-labeled by the level of student initiative, and examines associations between usage frequency, type, and performance on three midterms as a proxy for learning gains in critical thinking skills.
Significance. If the self-report data and midterm associations prove reliable, the work offers a practically useful taxonomy for understanding LLM integration in open-ended critical thinking tasks and extends single-offering findings to multiple course instances. This could inform pedagogical guidelines in HCI and education research on AI-assisted learning.
major comments (2)
- [Methods] Methods section on data collection: The central taxonomy and associations rest on retrospective self-reports of LLM usage and initiative without apparent triangulation against platform logs, browser histories, or think-aloud data; this directly weakens the claim that the refined categorization accurately reflects behaviors during the homework assignments.
- [Results/Analysis] Section describing learning outcomes and analysis: Midterm performance is used to measure learning gains from the LLM-supported homework without reported controls for prior knowledge, general academic ability, pre/post design, or motivation; this makes it difficult to isolate the impact of usage frequency and type on critical-thinking skill development.
minor comments (1)
- [Abstract] The abstract would benefit from including sample sizes, key statistical methods, and a brief summary of main findings to allow readers to assess the strength of the reported associations.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's comments. We address each major comment below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section on data collection: The central taxonomy and associations rest on retrospective self-reports of LLM usage and initiative without apparent triangulation against platform logs, browser histories, or think-aloud data; this directly weakens the claim that the refined categorization accurately reflects behaviors during the homework assignments.
Authors: We recognize the limitation of relying solely on retrospective self-reports without additional triangulation methods such as platform logs or think-aloud protocols. Our taxonomy was developed bottom-up from the self-reported data, and the initiative labels were assigned based on students' descriptions of their usage. To address this, we will revise the Methods and Limitations sections to more explicitly acknowledge the potential discrepancies between reported and actual behaviors, and we will moderate our language regarding how accurately the categorization reflects in-situ behaviors. We will also suggest in the discussion that future studies could incorporate logging for validation. revision: partial
-
Referee: [Results/Analysis] Section describing learning outcomes and analysis: Midterm performance is used to measure learning gains from the LLM-supported homework without reported controls for prior knowledge, general academic ability, pre/post design, or motivation; this makes it difficult to isolate the impact of usage frequency and type on critical-thinking skill development.
Authors: We agree that our study design does not allow for causal inferences about the impact of LLM usage on learning gains. The analysis focuses on associations between usage patterns and midterm performance as a proxy for critical thinking skills. We did not collect pre-course measures of prior knowledge or motivation, nor did we use a pre/post design. We will revise the Results, Analysis, and Discussion sections to clearly frame our findings as correlational associations rather than evidence of learning gains attributable to specific usage types. We will also expand the limitations section to highlight these design constraints and the need for controlled studies in future work. revision: yes
Circularity Check
No circularity: purely observational empirical analysis
full rationale
The paper describes an observational study collecting self-reported LLM usage from two course offerings, applying bottom-up categorization to the reports, and correlating usage frequency/types with existing midterm grades as a learning measure. No equations, derivations, fitted parameters, or predictive models are referenced that could reduce to inputs by construction. The abstract and structure indicate standard empirical data analysis without self-definitional loops, fitted-input predictions, or load-bearing self-citations that would force the central claims. The derivation chain is self-contained as data collection and association testing.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Self-reported LLM usage practices accurately reflect students' actual behaviors during homework assignments.
- domain assumption Midterm performance validly measures learning in critical thinking skills developed through the homework activities.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION As large language models (LLMs) become more powerful, so- phisticated, and accessible, the educational community con- tinues to grapple with understanding what roles these mod- els can and should play in education [21, 13, 11]. While some argue that integrating LLMs into coursework offers benefits such as immediate and personalized feedback [...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
RELA TED WORK In this section, we review prior work on students’ use of LLMs in education along two dimensions: (1) educational contexts of student LLM use, with a focus on differences across domains and instructional settings, and (2) approaches to analyzing students’ LLM interactions, including both top- down and bottom-up approaches. 2.1 Educational Co...
-
[3]
3.1 Course Structure Our exploratory analysis was conducted in a third-year un- dergraduate course offered at a public North American in- stitution
DA TA OVERVIEW This section provides an overview of the course in which the data were collected and the composition of the dataset used for analysis. 3.1 Course Structure Our exploratory analysis was conducted in a third-year un- dergraduate course offered at a public North American in- stitution. The course’s objective was to teach students re- search me...
-
[4]
Prompt Used: Here is my summary for the attached paper: [student’s draft] Revise it for any grammatical issues
METHODS To determine whether data from Term 1 and Term 2 could be analyzed jointly, we compared the distributions of stu- dents’ backgrounds across the two terms, focusing on self- reported coding ability and disciplinary background. Al- though the cohorts were not identical, their distributions were similar enough for our purposes. For coding ability, st...
-
[5]
RESULTS In this section, we present our findings on how students used LLMs throughout the course and how these usage patterns relate to midterm exam performance, considering both the level of LLM use and differences in LLM usage types. 5.1 Students’ LLM Usage Trends We organize our analysis of students’ engagement with LLM into two parts: (1) whether and ...
-
[6]
We found that only a minority of students (23 out of 68; 34%) used LLMs for the course assignments
DISCUSSION In this section, we summarize our results regarding stu- dents’ LLM usage and its impact on learning, as measured by midterm performance. We found that only a minority of students (23 out of 68; 34%) used LLMs for the course assignments. Among these students, most (16 students) used LLMs lightly, relying on them for approximately 5–33% of their...
-
[7]
CONCLUSION This work examined how students used LLMs in a research- oriented course, revealing diverse patterns of engagement and how these behaviors related to performance on an inde- pendently completed midterm exam. Our analyses showed meaningful variation in when students chose to use LLMs, the types of support they sought, and the amount of effort th...
-
[8]
ACKNOWLEDGMENTS This work was funded by the UBC Skylight grant PM015111, falls under and complies with UBC’s policies 81 and LR11 regarding ownership and usage of course material for peda- gogical quality assurance
-
[9]
Brender, L
J. Brender, L. El-Hamamsy, F. Mondada, and E. Bumbacher. Who’s Helping Who? When Students Use ChatGPT to Engage in Practice Lab Sessions. In International Conference on Artificial Intelligence in Education, pages 235–249. Springer, 2024
2024
-
[10]
Bull and A
C. Bull and A. Kharrufa. Generative Artificial Intelligence Assistants in Software Development Education: A Vision for Integrating Generative Artificial Intelligence into Educational Practice, Not Instinctively Defending Against It.IEEE Software, 41(2):52–59, 2023
2023
-
[11]
Flower and J
L. Flower and J. R. Hayes. A cognitive process theory of writing.College Composition and Communication, 32(4):365–387, 1981
1981
-
[12]
J. Gao, S. A. Gebreegziabher, K. T. W. Choo, T. J.-J. Li, S. T. Perrault, and T. W. Malone. A Taxonomy for Human-LLM Interaction Modes: An Initial Exploration. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–11, 2024
2024
-
[13]
Ghimire and J
A. Ghimire and J. Edwards. Coding with AI: How Are Tools Like ChatGPT Being Used by Students in Foundational Programming Courses. InInternational Conference on Artificial Intelligence in Education, pages 259–267. Springer, 2024
2024
-
[14]
L. Giray. Prompt engineering with chatgpt: a guide for academic writers.Annals of biomedical engineering, 51(12):2629–2633, 2023
2023
-
[15]
Grande, N
V. Grande, N. Kiesler, and M. A. Francisco R. Student perspectives on using a large language model (llm) for an assignment on professional ethics. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, pages 478–484. Association for Computing Machinery, 2024
2024
-
[16]
An Empirical Study to Understand How Students Use ChatGPT for Writing Essays
A. Jelson, D. Manesh, A. Jang, D. Dunlap, Y.-H. Kim, and S. W. Lee. An empirical study to understand how students use chatgpt for writing essays.arXiv preprint arXiv:2501.10551, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Kazemitabaar, X
M. Kazemitabaar, X. Hou, A. Henley, B. J. Ericson, D. Weintrop, and T. Grossman. How Novices Use LLM-based Code Generators to Solve CS1 Coding Tasks in a Self-Paced Learning Environment. In Proceedings of the 23rd Koli calling international conference on computing education research, pages 1–12, 2023
2023
-
[18]
J. Kim, S. Yu, S.-S. Lee, and R. Detrick. Students’ prompt patterns and its effects in ai-assisted academic writing: Focusing on students’ level of ai literacy. Journal of Research on Technology in Education, 58(3):638–655, 2026
2026
-
[19]
Ban It Till We Understand It
S. Lau and P. Guo. From “Ban It Till We Understand It” to “Resistance is Futile”: How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot. In Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume 1, pages 106–121, 2023
2023
-
[20]
B. Ma, L. Chen, and S. Konomi.Examining Student-ChatGPT Interactions in Programming Education, pages 97–114. Springer Nature Switzerland, Cham, 2025
2025
-
[21]
MacNeil, A
S. MacNeil, A. Tran, A. Hellas, J. Kim, S. Sarsa, P. Denny, S. Bernstein, and J. Leinonen. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, pages 931–937, 2023
2023
-
[22]
E. Mollick and L. Mollick. Assigning AI: Seven Approaches for Students, with Prompts.arXiv preprint arXiv:2306.10052, 2023
-
[23]
Orozco Vasquez, R
I. Orozco Vasquez, R. Mahinpei, N. Elouazizi, and C. Conati. An emergent bottom-up categorization of students’ llms usage in an undergraduate research course. InInternational Conference on Artificial Intelligence in Education, pages 133–139. Springer, 2025
2025
-
[24]
Rasnayaka, G
S. Rasnayaka, G. Wang, R. Shariffdeen, and G. N. Iyer. An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project. In Proceedings of the 1st International Workshop on Large Language Models for Code, pages 111–118, 2024
2024
- [25]
-
[26]
Sawalha, I
G. Sawalha, I. Taj, and A. Shoufan. Analyzing student prompts and their effect on ChatGPT’s performance. Cogent Education, 11(1):2397200, 2024
2024
-
[27]
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y. Li, A. Gupta, H. Han, S. Schulhoff, et al. The prompt report: a systematic survey of prompt engineering techniques.arXiv preprint arXiv:2406.06608, 2024
work page internal anchor Pith review arXiv 2024
-
[28]
D. Sun, A. Boudouaia, J. Yang, and J. Xu. Investigating students’ programming behaviors, interaction qualities and perceptions through prompt-based learning in chatgpt.Humanities and Social Sciences Communications, 11(1):1–14, 2024
2024
-
[29]
Vadaparty, D
A. Vadaparty, D. Zingaro, D. H. Smith IV, M. Padala, C. Alvarado, J. Gorson Benario, and L. Porter. CS1-LLM: Integrating LLMs into CS1 instruction. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, page 297–303. Association for Computing Machinery, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.