arxiv: 2605.04534 · v1 · submitted 2026-05-06 · 💻 cs.HC

Recognition: unknown

Characterizing Students' LLM Usage Behaviors and Their Association with Learning in Critical Thinking Tasks

Minju Park , Ivan Orozco Vasquez , Cristina Conati

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:55 UTC · model grok-4.3

classification 💻 cs.HC

keywords LLM usagecritical thinkingstudent behaviorslearning outcomesAI in educationpaper critique tasksusage categorization

0 comments

The pith

Refined categorization of how students use LLMs in paper critique homework links usage types and initiative levels to midterm performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes self-reported LLM usage from students completing homework that involves reading, reasoning about, and critiquing academic papers across two offerings of a research-oriented course. It develops a bottom-up classification of usage behaviors grouped by the amount of student initiative each type requires, extending an earlier analysis from one course. The study then tests whether frequency of use and specific usage categories relate to scores on three midterms that evaluate critical thinking skills. A sympathetic reader would care because the work addresses real, unrestricted student practices in a domain where LLMs are already common. If higher-initiative usages support better outcomes while lower-initiative ones do not, the distinction supplies a concrete basis for shaping how students interact with these tools.

Core claim

The authors establish a refined bottom-up categorization of LLM usage types drawn from student reports, with each type labeled by the extent of student initiative involved in the interaction. They further examine associations between these categories, overall usage frequency, and performance on midterms that measure critical thinking, using data from two course offerings to extend prior single-offering findings.

What carries the argument

The bottom-up categorization of LLM usage types cross-labeled by the degree of student initiative required in each interaction.

If this is right

Usage types that preserve higher student initiative may align with stronger midterm performance in critical thinking tasks.
Frequency of LLM use by itself may show weaker connections to learning outcomes than the specific type of use.
Patterns identified across two course offerings indicate repeatable behaviors that can be anticipated in similar settings.
The initiative dimension offers a practical lever for distinguishing supportive from less supportive LLM practices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Course designers could test prompts or guidelines that steer students toward higher-initiative LLM uses to support critical thinking practice.
Validation of self-reports against tool logs would strengthen the evidence base for these associations.
Applying the same initiative-labeled categorization to other subjects like problem-solving could reveal whether the pattern holds beyond paper critique.

Load-bearing premise

Students' self-reported LLM usage practices accurately reflect their actual behaviors on the homework assignments and that performance on the three midterms validly measures learning gains in critical thinking from those activities.

What would settle it

Direct logging or observation of students' actual LLM interactions during the assignments to check agreement with self-reports, or use of a pre-post assessment specifically targeting paper critique skills instead of relying on midterm scores.

Figures

Figures reproduced from arXiv: 2605.04534 by Cristina Conati, Ivan Orozco Vasquez, Minju Park.

**Figure 1.** Figure 1: LLM usage rates for discussion points and critical view at source ↗

**Figure 2.** Figure 2: Temporal visualization of reported LLM use across assignments. view at source ↗

**Figure 3.** Figure 3: Counts for each LLM usage category for discussion view at source ↗

**Figure 6.** Figure 6: Comparisons of midterm exam scores between stu view at source ↗

**Figure 5.** Figure 5: Counts for each LLM usage category, with stacked view at source ↗

**Figure 8.** Figure 8: Comparisons of midterm exam scores between view at source ↗

**Figure 7.** Figure 7: Comparisons of midterm exam scores between stu view at source ↗

read the original abstract

Large language models (LLMs) are becoming increasingly embedded in students' learning practices, yet much of what is known about how students use LLMs and how this usage impacts learning comes from problem-solving domains or constrained experimental settings. We present an analysis of data on LLM usage collected during two offerings of a research-oriented course where students learn to read, reason about, and critique academic papers. Without restrictions on whether or how to use LLMs, students reported their LLM usage practices when asked to do these activities as a series of homework assignments during the course. This paper extends prior work done on data from a single offering of the same course by presenting a refined bottom-up categorization of LLM usage types, cross-labeled by the extent of student initiative these usages entail. Furthermore, we examine how LLM use impacts student learning, measured by performance on three midterms, looking at factors such as frequency and type of usage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Refined taxonomy of LLM usage in critical thinking homework from two course offerings, with links to midterm performance, but self-reports and outcome measures need validation.

read the letter

The main things to know are that this paper extends their earlier single-offering study by pulling in data from a second semester of the same research-oriented course on reading and critiquing papers. They come up with a refined set of LLM usage categories, labeled also by how much the student took the initiative, and then check whether frequency or type of use relates to how students did on the three midterms. They do a decent job of describing real student practices in an open setting without forcing or banning LLM use. The bottom-up categorization and the initiative dimension give a clearer picture than just counting how often students used the tools. The soft spots are the reliance on self-reported usage after the assignments, which might not line up with what actually happened, and the use of midterm scores as a stand-in for learning gains from those specific tasks. Without any cross-checks like logs or pre/post measures, and no controls for things like overall student ability, it's tough to draw strong conclusions about impact on learning. This is for people working on LLM integration in education, especially in HCI or ed tech. Anyone thinking about how to handle these tools in classes that emphasize critical analysis will get some concrete examples from it. It deserves peer review because the setting is authentic and the questions are relevant right now, though the authors will probably need to address the measurement issues in revisions.

Referee Report

2 major / 1 minor

Summary. The manuscript analyzes self-reported LLM usage data collected from students in two offerings of a research-oriented course on reading, reasoning about, and critiquing academic papers. It refines a bottom-up taxonomy of LLM usage types cross-labeled by the level of student initiative, and examines associations between usage frequency, type, and performance on three midterms as a proxy for learning gains in critical thinking skills.

Significance. If the self-report data and midterm associations prove reliable, the work offers a practically useful taxonomy for understanding LLM integration in open-ended critical thinking tasks and extends single-offering findings to multiple course instances. This could inform pedagogical guidelines in HCI and education research on AI-assisted learning.

major comments (2)

[Methods] Methods section on data collection: The central taxonomy and associations rest on retrospective self-reports of LLM usage and initiative without apparent triangulation against platform logs, browser histories, or think-aloud data; this directly weakens the claim that the refined categorization accurately reflects behaviors during the homework assignments.
[Results/Analysis] Section describing learning outcomes and analysis: Midterm performance is used to measure learning gains from the LLM-supported homework without reported controls for prior knowledge, general academic ability, pre/post design, or motivation; this makes it difficult to isolate the impact of usage frequency and type on critical-thinking skill development.

minor comments (1)

[Abstract] The abstract would benefit from including sample sizes, key statistical methods, and a brief summary of main findings to allow readers to assess the strength of the reported associations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Methods] Methods section on data collection: The central taxonomy and associations rest on retrospective self-reports of LLM usage and initiative without apparent triangulation against platform logs, browser histories, or think-aloud data; this directly weakens the claim that the refined categorization accurately reflects behaviors during the homework assignments.

Authors: We recognize the limitation of relying solely on retrospective self-reports without additional triangulation methods such as platform logs or think-aloud protocols. Our taxonomy was developed bottom-up from the self-reported data, and the initiative labels were assigned based on students' descriptions of their usage. To address this, we will revise the Methods and Limitations sections to more explicitly acknowledge the potential discrepancies between reported and actual behaviors, and we will moderate our language regarding how accurately the categorization reflects in-situ behaviors. We will also suggest in the discussion that future studies could incorporate logging for validation. revision: partial
Referee: [Results/Analysis] Section describing learning outcomes and analysis: Midterm performance is used to measure learning gains from the LLM-supported homework without reported controls for prior knowledge, general academic ability, pre/post design, or motivation; this makes it difficult to isolate the impact of usage frequency and type on critical-thinking skill development.

Authors: We agree that our study design does not allow for causal inferences about the impact of LLM usage on learning gains. The analysis focuses on associations between usage patterns and midterm performance as a proxy for critical thinking skills. We did not collect pre-course measures of prior knowledge or motivation, nor did we use a pre/post design. We will revise the Results, Analysis, and Discussion sections to clearly frame our findings as correlational associations rather than evidence of learning gains attributable to specific usage types. We will also expand the limitations section to highlight these design constraints and the need for controlled studies in future work. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical analysis

full rationale

The paper describes an observational study collecting self-reported LLM usage from two course offerings, applying bottom-up categorization to the reports, and correlating usage frequency/types with existing midterm grades as a learning measure. No equations, derivations, fitted parameters, or predictive models are referenced that could reduce to inputs by construction. The abstract and structure indicate standard empirical data analysis without self-definitional loops, fitted-input predictions, or load-bearing self-citations that would force the central claims. The derivation chain is self-contained as data collection and association testing.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of self-reported usage data and the assumption that midterm performance captures the targeted learning outcomes; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Self-reported LLM usage practices accurately reflect students' actual behaviors during homework assignments.
The study relies entirely on students reporting their practices when asked, with no independent verification such as logs or observations.
domain assumption Midterm performance validly measures learning in critical thinking skills developed through the homework activities.
The paper uses midterm scores as the outcome variable without discussing alternative measures or potential confounds.

pith-pipeline@v0.9.0 · 5459 in / 1413 out tokens · 96949 ms · 2026-05-08T16:55:24.187089+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 5 canonical work pages · 3 internal anchors

[1]

INTRODUCTION As large language models (LLMs) become more powerful, so- phisticated, and accessible, the educational community con- tinues to grapple with understanding what roles these mod- els can and should play in education [21, 13, 11]. While some argue that integrating LLMs into coursework offers benefits such as immediate and personalized feedback [...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

RELA TED WORK In this section, we review prior work on students’ use of LLMs in education along two dimensions: (1) educational contexts of student LLM use, with a focus on differences across domains and instructional settings, and (2) approaches to analyzing students’ LLM interactions, including both top- down and bottom-up approaches. 2.1 Educational Co...
[3]

3.1 Course Structure Our exploratory analysis was conducted in a third-year un- dergraduate course offered at a public North American in- stitution

DA TA OVERVIEW This section provides an overview of the course in which the data were collected and the composition of the dataset used for analysis. 3.1 Course Structure Our exploratory analysis was conducted in a third-year un- dergraduate course offered at a public North American in- stitution. The course’s objective was to teach students re- search me...
[4]

Prompt Used: Here is my summary for the attached paper: [student’s draft] Revise it for any grammatical issues

METHODS To determine whether data from Term 1 and Term 2 could be analyzed jointly, we compared the distributions of stu- dents’ backgrounds across the two terms, focusing on self- reported coding ability and disciplinary background. Al- though the cohorts were not identical, their distributions were similar enough for our purposes. For coding ability, st...
[5]

RESULTS In this section, we present our findings on how students used LLMs throughout the course and how these usage patterns relate to midterm exam performance, considering both the level of LLM use and differences in LLM usage types. 5.1 Students’ LLM Usage Trends We organize our analysis of students’ engagement with LLM into two parts: (1) whether and ...
[6]

We found that only a minority of students (23 out of 68; 34%) used LLMs for the course assignments

DISCUSSION In this section, we summarize our results regarding stu- dents’ LLM usage and its impact on learning, as measured by midterm performance. We found that only a minority of students (23 out of 68; 34%) used LLMs for the course assignments. Among these students, most (16 students) used LLMs lightly, relying on them for approximately 5–33% of their...
[7]

CONCLUSION This work examined how students used LLMs in a research- oriented course, revealing diverse patterns of engagement and how these behaviors related to performance on an inde- pendently completed midterm exam. Our analyses showed meaningful variation in when students chose to use LLMs, the types of support they sought, and the amount of effort th...
[8]

ACKNOWLEDGMENTS This work was funded by the UBC Skylight grant PM015111, falls under and complies with UBC’s policies 81 and LR11 regarding ownership and usage of course material for peda- gogical quality assurance
[9]

Brender, L

J. Brender, L. El-Hamamsy, F. Mondada, and E. Bumbacher. Who’s Helping Who? When Students Use ChatGPT to Engage in Practice Lab Sessions. In International Conference on Artificial Intelligence in Education, pages 235–249. Springer, 2024

2024
[10]

Bull and A

C. Bull and A. Kharrufa. Generative Artificial Intelligence Assistants in Software Development Education: A Vision for Integrating Generative Artificial Intelligence into Educational Practice, Not Instinctively Defending Against It.IEEE Software, 41(2):52–59, 2023

2023
[11]

Flower and J

L. Flower and J. R. Hayes. A cognitive process theory of writing.College Composition and Communication, 32(4):365–387, 1981

1981
[12]

J. Gao, S. A. Gebreegziabher, K. T. W. Choo, T. J.-J. Li, S. T. Perrault, and T. W. Malone. A Taxonomy for Human-LLM Interaction Modes: An Initial Exploration. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–11, 2024

2024
[13]

Ghimire and J

A. Ghimire and J. Edwards. Coding with AI: How Are Tools Like ChatGPT Being Used by Students in Foundational Programming Courses. InInternational Conference on Artificial Intelligence in Education, pages 259–267. Springer, 2024

2024
[14]

L. Giray. Prompt engineering with chatgpt: a guide for academic writers.Annals of biomedical engineering, 51(12):2629–2633, 2023

2023
[15]

Grande, N

V. Grande, N. Kiesler, and M. A. Francisco R. Student perspectives on using a large language model (llm) for an assignment on professional ethics. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, pages 478–484. Association for Computing Machinery, 2024

2024
[16]

An Empirical Study to Understand How Students Use ChatGPT for Writing Essays

A. Jelson, D. Manesh, A. Jang, D. Dunlap, Y.-H. Kim, and S. W. Lee. An empirical study to understand how students use chatgpt for writing essays.arXiv preprint arXiv:2501.10551, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Kazemitabaar, X

M. Kazemitabaar, X. Hou, A. Henley, B. J. Ericson, D. Weintrop, and T. Grossman. How Novices Use LLM-based Code Generators to Solve CS1 Coding Tasks in a Self-Paced Learning Environment. In Proceedings of the 23rd Koli calling international conference on computing education research, pages 1–12, 2023

2023
[18]

J. Kim, S. Yu, S.-S. Lee, and R. Detrick. Students’ prompt patterns and its effects in ai-assisted academic writing: Focusing on students’ level of ai literacy. Journal of Research on Technology in Education, 58(3):638–655, 2026

2026
[19]

Ban It Till We Understand It

S. Lau and P. Guo. From “Ban It Till We Understand It” to “Resistance is Futile”: How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot. In Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume 1, pages 106–121, 2023

2023
[20]

B. Ma, L. Chen, and S. Konomi.Examining Student-ChatGPT Interactions in Programming Education, pages 97–114. Springer Nature Switzerland, Cham, 2025

2025
[21]

MacNeil, A

S. MacNeil, A. Tran, A. Hellas, J. Kim, S. Sarsa, P. Denny, S. Bernstein, and J. Leinonen. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. InProceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, pages 931–937, 2023

2023
[22]

Mollick and L

E. Mollick and L. Mollick. Assigning AI: Seven Approaches for Students, with Prompts.arXiv preprint arXiv:2306.10052, 2023

work page arXiv 2023
[23]

Orozco Vasquez, R

I. Orozco Vasquez, R. Mahinpei, N. Elouazizi, and C. Conati. An emergent bottom-up categorization of students’ llms usage in an undergraduate research course. InInternational Conference on Artificial Intelligence in Education, pages 133–139. Springer, 2025

2025
[24]

Rasnayaka, G

S. Rasnayaka, G. Wang, R. Shariffdeen, and G. N. Iyer. An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project. In Proceedings of the 1st International Workshop on Large Language Models for Code, pages 111–118, 2024

2024
[25]

S. K. K. Santu and D. Feng. TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks.arXiv preprint arXiv:2305.11430, 2023

work page arXiv 2023
[26]

Sawalha, I

G. Sawalha, I. Taj, and A. Shoufan. Analyzing student prompts and their effect on ChatGPT’s performance. Cogent Education, 11(1):2397200, 2024

2024
[27]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y. Li, A. Gupta, H. Han, S. Schulhoff, et al. The prompt report: a systematic survey of prompt engineering techniques.arXiv preprint arXiv:2406.06608, 2024

work page internal anchor Pith review arXiv 2024
[28]

D. Sun, A. Boudouaia, J. Yang, and J. Xu. Investigating students’ programming behaviors, interaction qualities and perceptions through prompt-based learning in chatgpt.Humanities and Social Sciences Communications, 11(1):1–14, 2024

2024
[29]

Vadaparty, D

A. Vadaparty, D. Zingaro, D. H. Smith IV, M. Padala, C. Alvarado, J. Gorson Benario, and L. Porter. CS1-LLM: Integrating LLMs into CS1 instruction. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, page 297–303. Association for Computing Machinery, 2024

2024