Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education

Davinia Hern\'andez-Leo; Janne Rotter; Pau Benazet i Montobbio

arxiv: 2605.15850 · v2 · pith:GNR3B7VDnew · submitted 2026-05-15 · 💻 cs.CY · cs.AI· cs.HC

Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education

Janne Rotter , Pau Benazet i Montobbio , Davinia Hern\'andez-Leo This is my paper

Pith reviewed 2026-05-19 22:01 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.HC

keywords generative AIreinforcement learningaccess timingscaffoldingmetacognitioneducational technologylearning outcomesAI in education

0 comments

The pith

Strategically timed GenAI access decided by a reinforcement learning agent improves post-test performance and metacognitive accuracy over unrestricted or withheld use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether the timing of access to generative AI can function as implicit scaffolding for students. It implements this idea through a reinforcement learning agent whose access decisions receive rewards drawn from metacognitive theory, cognitive load theory, and productive failure. A mixed-methods study with 105 participants compared this agent against conditions of unrestricted GenAI access and complete withholding. The timed-access condition produced higher objective post-test scores and more accurate self-assessments than unrestricted access, while also lowering error rates and time on task relative to full restriction, all without added prompts or structured guidance. These results position access timing as a practical, theory-based way to integrate off-the-shelf AI tools into learning.

Core claim

The central claim is that an RL agent controlling the timing of GenAI access, with a reward function grounded in metacognitive theory, cognitive load theory, and productive failure, produces better objective learning gains and metacognitive accuracy than unrestricted access while also reducing errors and time on task compared with complete restriction, demonstrated in a controlled lab study without requiring explicit metacognitive prompts or additional scaffolding.

What carries the argument

The reinforcement learning agent that chooses moments for GenAI access according to a reward function derived from metacognitive theory, cognitive load theory, and productive failure.

If this is right

Students reach higher post-test performance when GenAI is available only at selected times rather than always.
Metacognitive accuracy rises without the addition of explicit prompts or instructions on AI use.
Task errors drop and time on task shortens compared with denying all GenAI access.
The approach operates with standard off-the-shelf GenAI tools and requires no custom interfaces or training.
Timing decisions become a scalable pedagogical method that improves over both unrestricted and fully restricted baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agents could extend to other AI tools or subject domains by swapping the reward function while keeping the timing mechanism.
Longer-term deployments might reveal whether repeated exposure to timed access produces lasting changes in how students self-regulate their tool use.
Educators could adapt the underlying timing logic manually in low-tech settings by observing student struggle points.
The method raises the possibility of hybrid systems that combine timed access with selective human oversight in classroom environments.

Load-bearing premise

The reward function based on those educational theories generates access decisions that causally improve learning and metacognition without any other scaffolding present.

What would settle it

A follow-up experiment in which the reinforcement learning condition shows no advantage in post-test performance or metacognitive accuracy over the unrestricted-access condition.

Figures

Figures reproduced from arXiv: 2605.15850 by Davinia Hern\'andez-Leo, Janne Rotter, Pau Benazet i Montobbio.

**Figure 1.** Figure 1: Visualization of Experiment Design the students self evaluation and the pre-post test difference of the Metacognitive Awareness Inventory for Artificial Intelligence (MAI-AI) scales, a questionnaire to assess metacognitive awareness when working with AI, were treated as dependent variables. Additionally, time on each task, LLM logs and free text reflection on participant’s work with the system were coll… view at source ↗

**Figure 3.** Figure 3: Boxplot comparison of the metacognitive ac [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 2.** Figure 2: Boxplot comparison of the objective post-test [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 4.** Figure 4: Comparison of GenAI usage patterns across [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of sentiment towards the own con [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the final policy in the case of [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Example screenshot of the ITS 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

In recent years, generative AI (GenAI) in educational settings has become ubiquitous in university students' daily lives, despite its potential to induce over-reliance, metacognitive disengagement, and diminished learning when used unrestrictedly. While most prior research has focused on how to pedagogically scaffold its usage, the question of when to allow off-the-shelf GenAI remains understudied and lacks pedagogically grounded empirical investigation. We treat access timing itself as a form of implicit scaffolding and operationalize it through a reinforcement learning (RL) agent that decides when students should access GenAI, with a reward function grounded in metacognitive theory, cognitive load theory, and productive failure. In a mixed-methods controlled lab study with N=105 higher education students, we compared the agent's effect on learning gains and metacognitive engagement to unrestricted and fully restricted use. Results show that strategically timed GenAI access under the reinforcement learning condition improved objective post-test performance and metacognitive accuracy compared with unrestricted access, while reducing task errors and time on task relative to complete withholding, thus outperforming both approaches without the need for explicit metacognitive prompts or structured scaffolding. However, no between-condition differences emerged on self-reported metacognitive awareness. Overall, timing of GenAI access therefore is a tractable, theoretically grounded, and scalable pedagogical strategy that improves over completely unrestricted and withheld access, compatible with off-the-shelf tools and potentially low adoption barrier. This opens up a new research area that explores how access timing can be facilitated by educators and implemented in human-AI learning system design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes treating the timing of access to off-the-shelf generative AI as a form of implicit scaffolding in education. It implements this via a reinforcement learning agent whose reward function is derived from metacognitive theory, cognitive load theory, and productive failure. In a mixed-methods lab study (N=105), the RL-timed access condition is compared to unrestricted and fully restricted GenAI use; the abstract reports improved post-test performance and metacognitive accuracy versus unrestricted access, plus reduced errors and time-on-task versus complete withholding, without explicit prompts.

Significance. If the empirical results are statistically robust and the theoretical reward components prove load-bearing, the work would establish access timing as a scalable, low-adoption-barrier alternative to explicit scaffolding for GenAI in learning systems. It would also demonstrate a concrete application of RL grounded in educational psychology rather than purely data-driven optimization.

major comments (3)

[Abstract / Results] Abstract and results section: the central empirical claims of between-condition differences in post-test performance, metacognitive accuracy, errors, and time-on-task are stated without any statistical details (p-values, effect sizes, confidence intervals, error bars, or baseline comparisons). This prevents evaluation of whether the reported improvements are reliable or practically meaningful.
[Methods / Results] Methods and results sections: no ablation or control condition isolates the contribution of the specific theory-derived reward terms (metacognition, cognitive load, productive failure) versus any adaptive access policy based on observable proxies such as error rate or dwell time. Without this, the causal link between the cited theories and the observed learning gains remains untested, as noted in the skeptic concern.
[Experimental Design] Experimental design: the abstract mentions N=105 but provides no exclusion criteria, randomization details, or power analysis. These omissions are load-bearing for interpreting the mixed-methods controlled lab study and the claim that timing alone suffices without explicit scaffolding.

minor comments (2)

[Methods] Notation for the RL state and action spaces could be clarified with a diagram or explicit equations to aid reproducibility.
[Results] The manuscript should include a table summarizing all between-condition comparisons with means, SDs, and test statistics rather than narrative description alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in the presentation and completeness of our manuscript. We address each major comment in turn below.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results section: the central empirical claims of between-condition differences in post-test performance, metacognitive accuracy, errors, and time-on-task are stated without any statistical details (p-values, effect sizes, confidence intervals, error bars, or baseline comparisons). This prevents evaluation of whether the reported improvements are reliable or practically meaningful.

Authors: We fully agree with this observation. The current version of the manuscript presents the results in a summary form without the supporting statistical information. In the revised version, we will include all relevant statistical details, including p-values, effect sizes (e.g., Cohen's d), confidence intervals, and error bars in the results section and update the abstract accordingly to reflect these. This will allow readers to assess the reliability and practical meaningfulness of the findings. revision: yes
Referee: [Methods / Results] Methods and results sections: no ablation or control condition isolates the contribution of the specific theory-derived reward terms (metacognition, cognitive load, productive failure) versus any adaptive access policy based on observable proxies such as error rate or dwell time. Without this, the causal link between the cited theories and the observed learning gains remains untested, as noted in the skeptic concern.

Authors: This is a valid concern regarding the causal attribution to the specific theoretical components. Our current study compares the complete RL agent (with the theory-derived reward) against non-adaptive baselines. To directly isolate each component would indeed require additional ablation conditions. We will revise the manuscript to include a more detailed explanation of how each reward term is operationalized and its theoretical justification. Additionally, we will add a limitations section acknowledging that while the overall policy shows benefits, future studies could perform ablations to test the individual contributions of the metacognition, cognitive load, and productive failure terms. We believe the current design still provides evidence for the value of theory-grounded timing over the extremes of unrestricted and restricted access. revision: partial
Referee: [Experimental Design] Experimental design: the abstract mentions N=105 but provides no exclusion criteria, randomization details, or power analysis. These omissions are load-bearing for interpreting the mixed-methods controlled lab study and the claim that timing alone suffices without explicit scaffolding.

Authors: We note that the full methods section of the manuscript does describe the study procedure, but we accept that key details such as exclusion criteria, randomization, and power analysis should be more prominently featured, especially given the abstract's mention of N=105. In the revision, we will add these details explicitly in the methods section and ensure the abstract or a summary highlights them. A post-hoc power analysis will be included to support the sample size. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper states that the reward function is grounded in external metacognitive theory, cognitive load theory, and productive failure rather than being fitted to the study's outcome measures. The RL agent is then deployed in a controlled lab experiment (N=105) whose post-test scores, metacognitive accuracy, error rates, and time-on-task are independent measured variables. No equations, self-citations, or parameter-fitting steps are described that would make the reported performance gains equivalent to the input reward definition by construction. The central claim therefore rests on an external theoretical grounding plus an empirical test, not on a closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the theory-derived reward function and on the assumption that the lab-study outcomes reflect genuine causal effects of access timing rather than artifacts of the specific participant pool or task.

axioms (1)

domain assumption Reward function grounded in metacognitive theory, cognitive load theory, and productive failure accurately operationalizes beneficial access timing
Invoked to justify the RL agent’s objective without explicit metacognitive prompts.

pith-pipeline@v0.9.0 · 5820 in / 1238 out tokens · 37056 ms · 2026-05-19T22:01:08.758560+00:00 · methodology

Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)