pith. machine review for the scientific record. sign in

arxiv: 2605.12363 · v1 · submitted 2026-05-12 · 💻 cs.CY · cs.HC

Recognition: no theorem link

Reimagining Assessment in the Age of Generative AI: Lessons from Open-Book Exams with ChatGPT

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:44 UTC · model grok-4.3

classification 💻 cs.CY cs.HC
keywords generative AIacademic assessmentChatGPTopen-book examsstudent reasoningAI-assisted learningeducational technology
0
0 comments X

The pith

Generative AI shifts exam assessment from producing solutions to judging their validity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study let engineering students use ChatGPT during take-home open-book exams and required them to submit their interaction transcripts with their answers. Qualitative review of those transcripts showed students moving through three patterns of engagement: simply retrieving answers, collaborating with the tool to refine outputs, and critically checking and fixing AI responses. This made visible forms of reasoning, such as prompt iteration and evaluation of incorrect results, that traditional answer-checking would miss. The core finding is that when AI is available, final-answer correctness alone stops being enough proof of learning, and skills like verification and judgment become the measurable competencies instead.

Core claim

The presence of generative AI shifted the cognitive task of assessment from producing solutions to assessing solution validity. Correctness of final answers alone may no longer provide sufficient evidence of comprehension. Instead, competencies such as prompt formulation, verification, and judgment become visible indicators of learning. Transparent integration of AI appeared to reduce focus on rule avoidance and promote self-regulation.

What carries the argument

The three observed patterns of AI use (answer retrieval, guided collaboration, critical verification) that become visible through required interaction transcripts and thereby expose evaluative reasoning.

If this is right

  • Final-answer correctness alone stops being sufficient evidence of comprehension.
  • Prompt formulation, output verification, and judgment become observable indicators of learning.
  • Transparent AI use reduces emphasis on rule avoidance and supports self-regulated reasoning.
  • Future assessments should target reasoning about solutions rather than independent production of them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Courses could redesign exams to require submission of AI transcripts as standard practice so that judgment processes are graded directly.
  • The same transcript method might reveal parallel shifts in reasoning in non-engineering fields such as writing or data analysis.
  • Professional workplaces already reward the ability to critique AI output, so this form of assessment may better align with later career demands.

Load-bearing premise

The three patterns of AI use are stable and generalizable across students and courses, and requiring transcript submission does not itself alter student behavior or introduce selection bias.

What would settle it

A follow-up study with a different student population or subject that finds only answer-retrieval behavior and no evidence of iterative verification, or in which students report changing their prompts because they knew transcripts would be read, would undermine the claim that assessment has shifted to validity evaluation.

read the original abstract

Generative AI systems such as ChatGPT challenge traditional assumptions about academic assessment by enabling students to generate explanations, code, and solutions in real time. Rather than attempting to restrict AI use, this study investigates how students actually interact with such systems during formal evaluation. Engineering students were permitted to use ChatGPT during take-home open-book exams and were required to submit interaction transcripts alongside exam solutions. This provided direct observational evidence of reasoning processes rather than relying on self-reported behavior. Qualitative analysis revealed three progressive patterns of use: answer retrieval, guided collaboration, and critical verification. While some students initially copied questions verbatim and received generic responses, many refined prompts iteratively and tested outputs. Some of the strongest evidence of reasoning appeared when students evaluated incorrect or incomplete AI responses, revealing evaluative reasoning through debugging, comparison, and justification. The presence of generative AI shifted the cognitive task of assessment from producing solutions to assessing solution validity. The findings suggest that, in AI-mediated assessment environments, correctness of final answers alone may no longer provide sufficient evidence of comprehension. Instead, competencies such as prompt formulation, verification, and judgment become visible indicators of learning. Transparent integration of AI appeared to reduce focus on rule avoidance and promote self-regulation. Assessments should evolve to evaluate reasoning about solutions rather than independent solution production. Generative AI therefore does not invalidate assessment but has the potential to expose deeper forms of understanding aligned with professional practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines how engineering students interact with ChatGPT during take-home open-book exams where AI use is permitted and interaction transcripts must be submitted with solutions. Qualitative analysis of the transcripts identifies three progressive patterns of use—answer retrieval, guided collaboration, and critical verification—and concludes that generative AI shifts the cognitive demands of assessment from producing solutions to evaluating their validity. Consequently, competencies such as prompt formulation, verification, and judgment become visible indicators of learning, and the authors recommend redesigning assessments to focus on reasoning about solutions rather than independent production.

Significance. If the observed patterns prove robust and generalizable, the work supplies direct observational data on AI-mediated assessment that could usefully inform evolving educational practices, moving beyond prohibition or detection toward integration that surfaces higher-order skills. The transcript-submission method offers a concrete alternative to self-report surveys, providing a model for future studies of human-AI collaboration in authentic settings.

major comments (2)
  1. [Methods] Methods section: the manuscript supplies no sample size, number of transcripts analyzed, coding scheme, inter-rater reliability metrics, or exclusion criteria for the qualitative analysis. Without these details the leap from raw transcripts to the three stable patterns and the claim that AI 'shifted the cognitive task' cannot be evaluated for analytic rigor or replicability.
  2. [Study Design and Discussion] Study design and Discussion: mandatory transcript submission creates a plausible Hawthorne effect and selection bias, since students knew their reasoning would be reviewed. This design choice undermines the central inference that the patterns (especially critical verification) reflect unaltered AI use rather than the observability requirement itself; a no-transcript control arm or participation-rate checks are needed to support the claim that 'correctness of final answers alone may no longer provide sufficient evidence of comprehension.'
minor comments (2)
  1. [Abstract and Results] Abstract and Results: concrete transcript excerpts or tabulated counts of each pattern would make the progression from retrieval to verification more transparent and allow readers to assess the strength of the qualitative evidence.
  2. [Discussion] The manuscript would benefit from explicit discussion of how the take-home format and open-book policy interact with AI use, as these contextual factors may limit generalizability to in-class or closed-book settings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the transparency and interpretive caution of our work. We respond to each major comment below and indicate the revisions we will undertake.

read point-by-point responses
  1. Referee: [Methods] Methods section: the manuscript supplies no sample size, number of transcripts analyzed, coding scheme, inter-rater reliability metrics, or exclusion criteria for the qualitative analysis. Without these details the leap from raw transcripts to the three stable patterns and the claim that AI 'shifted the cognitive task' cannot be evaluated for analytic rigor or replicability.

    Authors: We agree that the current Methods section lacks sufficient detail on the qualitative analysis procedures. In the revised manuscript we will add the total number of transcripts analyzed, a description of the inductive coding process that led to the three patterns, inter-rater reliability statistics, and explicit exclusion criteria. These additions will allow readers to evaluate the analytic rigor directly. The three patterns were identified through iterative, data-driven coding of the full set of submitted transcripts; the interpretation that AI use shifted the cognitive task from production to verification is an inference drawn from instances in which students explicitly critiqued, debugged, or justified AI outputs. revision: yes

  2. Referee: [Study Design and Discussion] Study design and Discussion: mandatory transcript submission creates a plausible Hawthorne effect and selection bias, since students knew their reasoning would be reviewed. This design choice undermines the central inference that the patterns (especially critical verification) reflect unaltered AI use rather than the observability requirement itself; a no-transcript control arm or participation-rate checks are needed to support the claim that 'correctness of final answers alone may no longer provide sufficient evidence of comprehension.'

    Authors: We recognize that requiring transcript submission introduces a plausible Hawthorne effect and possible selection bias, as students were aware their interactions would be examined. This is a genuine limitation of the study design. In the revised Discussion we will expand the limitations subsection to address this issue explicitly, report available participation-rate information, and qualify the central claim by noting that the observed patterns (including critical verification) occurred under conditions of observability. A no-transcript control arm was not feasible within the authentic take-home exam setting in which the study was conducted; however, we maintain that the transparent design still yields valuable observational data that self-report methods cannot provide, and we will temper the language around generalizability accordingly. revision: partial

Circularity Check

0 steps flagged

No circularity: purely observational qualitative study with no derivations or self-referential modeling

full rationale

The paper conducts a qualitative analysis of student-AI interaction transcripts from open-book exams. All claims (shift from solution production to validity assessment, visibility of prompt formulation/verification/judgment) are presented as direct inferences from observed patterns in the submitted transcripts. There are no equations, no fitted parameters, no predictions derived from subsets of the data, and no load-bearing self-citations or uniqueness theorems. The derivation chain consists entirely of empirical observation and interpretation; it does not reduce any result to its own inputs by construction. This is the most common honest finding for non-mathematical empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is purely empirical and qualitative. It introduces no free parameters, no new physical or mathematical entities, and relies only on standard assumptions of qualitative thematic analysis.

axioms (1)
  • domain assumption Qualitative thematic analysis of chat transcripts can reliably surface stable patterns of student reasoning.
    The three progressive patterns are derived from this analytic step.

pith-pipeline@v0.9.0 · 5550 in / 1295 out tokens · 63600 ms · 2026-05-13T02:44:38.165208+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    Learning and Individual Differences103, 102274 (Apr 2023)

    E. Kasneci, K. Sessler, M. Küchemann, et al., “ChatGPT for good? On opportunities and challenges of large language models for education,” Learning and Individual Differences , vol. 103, 102274, 2023. https://doi.org/10.1016/j.lindif.2023.102274

  2. [2]

    I., Bond, M., and Gouverneur, F

    O. Zawacki -Richter, V. I. Marín, M. Bond, and F. Gouverneur, “Systematic review of research on artificial intelligence applications in higher education,” International Journal of Educational Technology in Higher Education , vol. 16, no. 39, 2019. https://doi.org/10.1186/s41239-019-0171-0

  3. [3]

    Holmes, M

    W. Holmes, M. Bialik, and C. Fadel, Artificial Intelligence in Education: Promises and Implications for Teaching and Learning . Boston, MA: Center for Curriculum Redesign, 2019. [Online]. Available: https://curriculumredesign.org/wp- content/uploads/AIED-Book-Excerpt-CCR.pdf

  4. [4]

    Chatting and cheating: Ensuring academic integrity in the era of ChatGPT,

    D. Cotton, P. Cotton, and J. Shipway, “Chatting and cheating: Ensuring academic integrity in the era of ChatGPT,” Innovations in Education and Teaching International, 61(2), 228 –239, 2024. https://doi.org/10.1080/14703297.2023.2190148

  5. [5]

    ChatGPT: The end of online exam integrity?,

    T. Susnjak, “ChatGPT: The end of online exam integrity?,” arXiv:2212.09292, Dec. 2022. [Online]. Available: https://arxiv.org/abs/2212.09292

  6. [6]

    ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?,

    J. Rudolph, S. Tan, and S. Tan, “ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?,” Journal of Applied Learning & Teaching, vol. 6, no. 1, 2023. https://doi.org/10.37074/jalt.2023.6.1.9

  7. [7]

    Assigning AI: Seven approaches for students, with prompts,

    E. Mollick, “Assigning AI: Seven approaches for students, with prompts,” Wharton Interactive Working Paper, University of Pennsylvania, 2023. [Online]. Available: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4 475995

  8. [8]

    Challenges of implementing ChatGPT on education: Systematic literature review,

    I. Miguel García -López, Ca. Soledad González González, M. Ramírez -Montoya, and J. Molina - Espinosa, “Challenges of implementing ChatGPT on education: Systematic literature review,” International Journal of Educational Research Open , Volume 8, 2025, 100401, ISSN 2666 -3740, https://doi.org/10.1016/j.ijedro.2024.100401

  9. [9]

    Aligning assessment with long-term learning,

    D. Boud and N. Falchikov, “Aligning assessment with long-term learning,” Assessment & Evaluation in Higher Education, vol. 31, no. 4, pp. 399 –413, 2006. https://doi.org/10.1080/02602930600679050

  10. [10]

    E -assessment by design: Using multiple - choice tests to promote student learning,

    D. Nicol, “E -assessment by design: Using multiple - choice tests to promote student learning,” British Journal of Educational Technology, vol. 38, no. 1, pp. 53–63, 2007. https://doi.org/10.1080/03098770601167922 Page 10 of 10

  11. [11]

    Assessment and learning outcomes for generative AI in higher education: A scoping review on current research status and trends,

    X. Weng, Q. Xia, M. Gu, K. Rajaram, and T. K. F. Chiu, “Assessment and learning outcomes for generative AI in higher education: A scoping review on current research status and trends,” Australasian Journal of Educational Technology, vol. 40, no. 6, pp. 37–55, 2024. https://doi.org/10.14742/ajet.9540

  12. [12]

    Dawson, M

    M. Bearman, J. Tai, P. Dawson, D. Boud, and R. Ajjawi, “Developing evaluative judgement for a time of generative artificial intelligence,” Assessment & Evaluation in Higher Education, vol. 49, no. 6, pp. 893– 905, 2024. https://doi.org/10.1080/02602938.2024.2335321

  13. [13]

    Extended TAM based acceptance of AI-Powered ChatGPT for supporting metacognitive self-regulated learning in education: A mixed-methods study,

    N. A. Dahri, N. Yahaya, W. M. Al -Rahmi, A. Aldraiweesh, U. Alturki, S. Almutairy, A. Shutaleva, and R. B. Soomro, “Extended TAM based acceptance of AI-Powered ChatGPT for supporting metacognitive self-regulated learning in education: A mixed-methods study,” Heliyon, vol. 10, no. 8, e29317, 2024. https://doi.org/10.1016/j.heliyon.2024.e29317

  14. [14]

    Implementing generative AI (GenAI) in higher education: A systematic review of case studies,

    M. Belkina, S. Daniel, S. Nikolic, R. Haque, S. Lyden, P. Neal, S. Grundy, and G. M. Hassan, “Implementing generative AI (GenAI) in higher education: A systematic review of case studies,” Computers and Education: Artificial Intelligence, vol. 8, 100407, 2025. https://doi.org/10.1016/j.caeai.2025.100407

  15. [15]

    Reflecting on reflexive thematic analysis,

    V. Braun and V. Clarke, “Reflecting on reflexive thematic analysis,” Qualitative Research in Sport, Exercise and Health, vol. 11, no. 4, pp. 589–597, 2019. https://doi.org/10.1080/2159676X.2019.1628806