Recognition: no theorem link
Reimagining Assessment in the Age of Generative AI: Lessons from Open-Book Exams with ChatGPT
Pith reviewed 2026-05-13 02:44 UTC · model grok-4.3
The pith
Generative AI shifts exam assessment from producing solutions to judging their validity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The presence of generative AI shifted the cognitive task of assessment from producing solutions to assessing solution validity. Correctness of final answers alone may no longer provide sufficient evidence of comprehension. Instead, competencies such as prompt formulation, verification, and judgment become visible indicators of learning. Transparent integration of AI appeared to reduce focus on rule avoidance and promote self-regulation.
What carries the argument
The three observed patterns of AI use (answer retrieval, guided collaboration, critical verification) that become visible through required interaction transcripts and thereby expose evaluative reasoning.
If this is right
- Final-answer correctness alone stops being sufficient evidence of comprehension.
- Prompt formulation, output verification, and judgment become observable indicators of learning.
- Transparent AI use reduces emphasis on rule avoidance and supports self-regulated reasoning.
- Future assessments should target reasoning about solutions rather than independent production of them.
Where Pith is reading between the lines
- Courses could redesign exams to require submission of AI transcripts as standard practice so that judgment processes are graded directly.
- The same transcript method might reveal parallel shifts in reasoning in non-engineering fields such as writing or data analysis.
- Professional workplaces already reward the ability to critique AI output, so this form of assessment may better align with later career demands.
Load-bearing premise
The three patterns of AI use are stable and generalizable across students and courses, and requiring transcript submission does not itself alter student behavior or introduce selection bias.
What would settle it
A follow-up study with a different student population or subject that finds only answer-retrieval behavior and no evidence of iterative verification, or in which students report changing their prompts because they knew transcripts would be read, would undermine the claim that assessment has shifted to validity evaluation.
read the original abstract
Generative AI systems such as ChatGPT challenge traditional assumptions about academic assessment by enabling students to generate explanations, code, and solutions in real time. Rather than attempting to restrict AI use, this study investigates how students actually interact with such systems during formal evaluation. Engineering students were permitted to use ChatGPT during take-home open-book exams and were required to submit interaction transcripts alongside exam solutions. This provided direct observational evidence of reasoning processes rather than relying on self-reported behavior. Qualitative analysis revealed three progressive patterns of use: answer retrieval, guided collaboration, and critical verification. While some students initially copied questions verbatim and received generic responses, many refined prompts iteratively and tested outputs. Some of the strongest evidence of reasoning appeared when students evaluated incorrect or incomplete AI responses, revealing evaluative reasoning through debugging, comparison, and justification. The presence of generative AI shifted the cognitive task of assessment from producing solutions to assessing solution validity. The findings suggest that, in AI-mediated assessment environments, correctness of final answers alone may no longer provide sufficient evidence of comprehension. Instead, competencies such as prompt formulation, verification, and judgment become visible indicators of learning. Transparent integration of AI appeared to reduce focus on rule avoidance and promote self-regulation. Assessments should evolve to evaluate reasoning about solutions rather than independent solution production. Generative AI therefore does not invalidate assessment but has the potential to expose deeper forms of understanding aligned with professional practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines how engineering students interact with ChatGPT during take-home open-book exams where AI use is permitted and interaction transcripts must be submitted with solutions. Qualitative analysis of the transcripts identifies three progressive patterns of use—answer retrieval, guided collaboration, and critical verification—and concludes that generative AI shifts the cognitive demands of assessment from producing solutions to evaluating their validity. Consequently, competencies such as prompt formulation, verification, and judgment become visible indicators of learning, and the authors recommend redesigning assessments to focus on reasoning about solutions rather than independent production.
Significance. If the observed patterns prove robust and generalizable, the work supplies direct observational data on AI-mediated assessment that could usefully inform evolving educational practices, moving beyond prohibition or detection toward integration that surfaces higher-order skills. The transcript-submission method offers a concrete alternative to self-report surveys, providing a model for future studies of human-AI collaboration in authentic settings.
major comments (2)
- [Methods] Methods section: the manuscript supplies no sample size, number of transcripts analyzed, coding scheme, inter-rater reliability metrics, or exclusion criteria for the qualitative analysis. Without these details the leap from raw transcripts to the three stable patterns and the claim that AI 'shifted the cognitive task' cannot be evaluated for analytic rigor or replicability.
- [Study Design and Discussion] Study design and Discussion: mandatory transcript submission creates a plausible Hawthorne effect and selection bias, since students knew their reasoning would be reviewed. This design choice undermines the central inference that the patterns (especially critical verification) reflect unaltered AI use rather than the observability requirement itself; a no-transcript control arm or participation-rate checks are needed to support the claim that 'correctness of final answers alone may no longer provide sufficient evidence of comprehension.'
minor comments (2)
- [Abstract and Results] Abstract and Results: concrete transcript excerpts or tabulated counts of each pattern would make the progression from retrieval to verification more transparent and allow readers to assess the strength of the qualitative evidence.
- [Discussion] The manuscript would benefit from explicit discussion of how the take-home format and open-book policy interact with AI use, as these contextual factors may limit generalizability to in-class or closed-book settings.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the transparency and interpretive caution of our work. We respond to each major comment below and indicate the revisions we will undertake.
read point-by-point responses
-
Referee: [Methods] Methods section: the manuscript supplies no sample size, number of transcripts analyzed, coding scheme, inter-rater reliability metrics, or exclusion criteria for the qualitative analysis. Without these details the leap from raw transcripts to the three stable patterns and the claim that AI 'shifted the cognitive task' cannot be evaluated for analytic rigor or replicability.
Authors: We agree that the current Methods section lacks sufficient detail on the qualitative analysis procedures. In the revised manuscript we will add the total number of transcripts analyzed, a description of the inductive coding process that led to the three patterns, inter-rater reliability statistics, and explicit exclusion criteria. These additions will allow readers to evaluate the analytic rigor directly. The three patterns were identified through iterative, data-driven coding of the full set of submitted transcripts; the interpretation that AI use shifted the cognitive task from production to verification is an inference drawn from instances in which students explicitly critiqued, debugged, or justified AI outputs. revision: yes
-
Referee: [Study Design and Discussion] Study design and Discussion: mandatory transcript submission creates a plausible Hawthorne effect and selection bias, since students knew their reasoning would be reviewed. This design choice undermines the central inference that the patterns (especially critical verification) reflect unaltered AI use rather than the observability requirement itself; a no-transcript control arm or participation-rate checks are needed to support the claim that 'correctness of final answers alone may no longer provide sufficient evidence of comprehension.'
Authors: We recognize that requiring transcript submission introduces a plausible Hawthorne effect and possible selection bias, as students were aware their interactions would be examined. This is a genuine limitation of the study design. In the revised Discussion we will expand the limitations subsection to address this issue explicitly, report available participation-rate information, and qualify the central claim by noting that the observed patterns (including critical verification) occurred under conditions of observability. A no-transcript control arm was not feasible within the authentic take-home exam setting in which the study was conducted; however, we maintain that the transparent design still yields valuable observational data that self-report methods cannot provide, and we will temper the language around generalizability accordingly. revision: partial
Circularity Check
No circularity: purely observational qualitative study with no derivations or self-referential modeling
full rationale
The paper conducts a qualitative analysis of student-AI interaction transcripts from open-book exams. All claims (shift from solution production to validity assessment, visibility of prompt formulation/verification/judgment) are presented as direct inferences from observed patterns in the submitted transcripts. There are no equations, no fitted parameters, no predictions derived from subsets of the data, and no load-bearing self-citations or uniqueness theorems. The derivation chain consists entirely of empirical observation and interpretation; it does not reduce any result to its own inputs by construction. This is the most common honest finding for non-mathematical empirical papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Qualitative thematic analysis of chat transcripts can reliably surface stable patterns of student reasoning.
Reference graph
Works this paper leans on
-
[1]
Learning and Individual Differences103, 102274 (Apr 2023)
E. Kasneci, K. Sessler, M. Küchemann, et al., “ChatGPT for good? On opportunities and challenges of large language models for education,” Learning and Individual Differences , vol. 103, 102274, 2023. https://doi.org/10.1016/j.lindif.2023.102274
-
[2]
I., Bond, M., and Gouverneur, F
O. Zawacki -Richter, V. I. Marín, M. Bond, and F. Gouverneur, “Systematic review of research on artificial intelligence applications in higher education,” International Journal of Educational Technology in Higher Education , vol. 16, no. 39, 2019. https://doi.org/10.1186/s41239-019-0171-0
-
[3]
W. Holmes, M. Bialik, and C. Fadel, Artificial Intelligence in Education: Promises and Implications for Teaching and Learning . Boston, MA: Center for Curriculum Redesign, 2019. [Online]. Available: https://curriculumredesign.org/wp- content/uploads/AIED-Book-Excerpt-CCR.pdf
work page 2019
-
[4]
Chatting and cheating: Ensuring academic integrity in the era of ChatGPT,
D. Cotton, P. Cotton, and J. Shipway, “Chatting and cheating: Ensuring academic integrity in the era of ChatGPT,” Innovations in Education and Teaching International, 61(2), 228 –239, 2024. https://doi.org/10.1080/14703297.2023.2190148
-
[5]
ChatGPT: The end of online exam integrity?,
T. Susnjak, “ChatGPT: The end of online exam integrity?,” arXiv:2212.09292, Dec. 2022. [Online]. Available: https://arxiv.org/abs/2212.09292
-
[6]
ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?,
J. Rudolph, S. Tan, and S. Tan, “ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?,” Journal of Applied Learning & Teaching, vol. 6, no. 1, 2023. https://doi.org/10.37074/jalt.2023.6.1.9
-
[7]
Assigning AI: Seven approaches for students, with prompts,
E. Mollick, “Assigning AI: Seven approaches for students, with prompts,” Wharton Interactive Working Paper, University of Pennsylvania, 2023. [Online]. Available: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4 475995
work page 2023
-
[8]
Challenges of implementing ChatGPT on education: Systematic literature review,
I. Miguel García -López, Ca. Soledad González González, M. Ramírez -Montoya, and J. Molina - Espinosa, “Challenges of implementing ChatGPT on education: Systematic literature review,” International Journal of Educational Research Open , Volume 8, 2025, 100401, ISSN 2666 -3740, https://doi.org/10.1016/j.ijedro.2024.100401
-
[9]
Aligning assessment with long-term learning,
D. Boud and N. Falchikov, “Aligning assessment with long-term learning,” Assessment & Evaluation in Higher Education, vol. 31, no. 4, pp. 399 –413, 2006. https://doi.org/10.1080/02602930600679050
-
[10]
E -assessment by design: Using multiple - choice tests to promote student learning,
D. Nicol, “E -assessment by design: Using multiple - choice tests to promote student learning,” British Journal of Educational Technology, vol. 38, no. 1, pp. 53–63, 2007. https://doi.org/10.1080/03098770601167922 Page 10 of 10
-
[11]
X. Weng, Q. Xia, M. Gu, K. Rajaram, and T. K. F. Chiu, “Assessment and learning outcomes for generative AI in higher education: A scoping review on current research status and trends,” Australasian Journal of Educational Technology, vol. 40, no. 6, pp. 37–55, 2024. https://doi.org/10.14742/ajet.9540
-
[12]
M. Bearman, J. Tai, P. Dawson, D. Boud, and R. Ajjawi, “Developing evaluative judgement for a time of generative artificial intelligence,” Assessment & Evaluation in Higher Education, vol. 49, no. 6, pp. 893– 905, 2024. https://doi.org/10.1080/02602938.2024.2335321
-
[13]
N. A. Dahri, N. Yahaya, W. M. Al -Rahmi, A. Aldraiweesh, U. Alturki, S. Almutairy, A. Shutaleva, and R. B. Soomro, “Extended TAM based acceptance of AI-Powered ChatGPT for supporting metacognitive self-regulated learning in education: A mixed-methods study,” Heliyon, vol. 10, no. 8, e29317, 2024. https://doi.org/10.1016/j.heliyon.2024.e29317
-
[14]
Implementing generative AI (GenAI) in higher education: A systematic review of case studies,
M. Belkina, S. Daniel, S. Nikolic, R. Haque, S. Lyden, P. Neal, S. Grundy, and G. M. Hassan, “Implementing generative AI (GenAI) in higher education: A systematic review of case studies,” Computers and Education: Artificial Intelligence, vol. 8, 100407, 2025. https://doi.org/10.1016/j.caeai.2025.100407
-
[15]
Reflecting on reflexive thematic analysis,
V. Braun and V. Clarke, “Reflecting on reflexive thematic analysis,” Qualitative Research in Sport, Exercise and Health, vol. 11, no. 4, pp. 589–597, 2019. https://doi.org/10.1080/2159676X.2019.1628806
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.