Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

Aske Plaat; Michiel T. van der Meer; Nicholas Hogan; Po-Chin Chang

arxiv: 2606.20138 · v1 · pith:XJOM2KTPnew · submitted 2026-06-18 · 💻 cs.AI · cs.CL· cs.HC· cs.LG

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

Po-Chin Chang , Nicholas Hogan , Aske Plaat , Michiel T. van der Meer This is my paper

Pith reviewed 2026-06-26 17:45 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HCcs.LG

keywords LLM tutoringadaptive promptingprompt routingstudent engagementhigh-school educationsimulation-to-realpedagogical strategiesexercise conversion

0 comments

The pith

An adaptive LLM tutoring system using a stochastic prompt router raises exercise conversion to 28.1 percent and shortens sessions by three turns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops and tests an adaptive tutoring system that extracts 14 pedagogical features from transcripts to select subject-aware prompts via a router model. The router is first trained inside a simulation environment and then deployed live with high-school students. In simulation the router beats static baselines, and in A/B testing the adaptive system transfers the learned behavior, switching strategies as needed. The stochastic version of the router produces a higher rate of students completing exercises while the overall adaptive mechanism reduces the number of conversation turns required.

Core claim

A prompt routing model trained in simulation and deployed adaptively achieves sim-to-real transfer by switching from analytical to scaffolding strategies; the adaptive selection improves instructional efficiency by reducing interactions by around 3 turns while a stochastic router raises exercise conversion rate to 28.1 percent compared with 19.6 percent for the baseline.

What carries the argument

The stochastic router that samples from pedagogical strategies informed by 14 features extracted from raw transcripts.

Load-bearing premise

The simulation environment used to train the router accurately captures the distribution of real student responses and engagement patterns that occur in live high-school tutoring sessions.

What would settle it

An A/B test that keeps the same interface and student pool but replaces the learned router with random strategy selection and finds no difference in conversion rate or turn count would falsify the claim that the trained routing drives the gains.

Figures

Figures reproduced from arXiv: 2606.20138 by Aske Plaat, Michiel T. van der Meer, Nicholas Hogan, Po-Chin Chang.

**Figure 2.** Figure 2: Score Distribution by Simulated-Student Profile. The clear separation between “Motivated”, “Mediocre”, and “Unmotivated” simulated-student profiles establishes a difficulty gradient for the model and confirms the LLM evaluator’s discriminative validity. To quantitatively validate whether the router discovered the optimal Subject-Prompt mapping, we conducted an empirical alignment analysis. We compared the… view at source ↗

**Figure 4.** Figure 4: Learning dynamics in simulation. (a) Stable score growth despite high environment noise. (b) Strategy [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Training score trajectory during the live deployment phase. The ascent after March 28 suggests the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Detailed Architecture of the Actor-Critic Policy Model. The framework integrates frozen semantic features [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Cosine similarity heatmaps (a) the collapsed space of pre-trained topic embeddings vs. (b) the representa [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Empirical Score Calibration. Fixing K = 3.0 provides the optimal difficulty gradient and prevents over-optimization on unrealistic synthetic feedback [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Consistency of Feedback Features. The horizontal bars are the agreement rate for each criterion. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Overall pedagogical score distribution across 390 simulated scenarios. The dynamic prompt selection [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Score distribution stratified by simulated student profile. The routing model demonstrates superior [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Detailed criteria analysis. The router outperforms baselines in driving higher student correctness. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Overall distribution of conversational turns. The dataset exhibits a high concentration of very short [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of interaction length for substantial interactions (Turns [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Comparison of Exercise Conversion Rates. (a) The detailed breakdown reveals that the exploration group [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Distribution of student exercise accuracy categorized by AI score brackets. Although the difference [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Comparison of Exercise Accuracy. The exploitation group of router shows the highest average accuracy [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

read the original abstract

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a live A/B deployment of a feature-based prompt router for LLM tutoring with modest efficiency gains, but the simulation used to train it has no reported validation against real student data.

read the letter

The main point is that they extract 14 pedagogical features from transcripts, train a router in simulation, and then run it in an A/B test with 359 high-school students. The adaptive system cuts interactions by roughly 3 turns (p=0.007) and the stochastic router lifts exercise conversion to 28.1% from the 19.6% baseline.

What stands out is the actual deployment data rather than another simulation-only result. They report concrete numbers from both the simulation benchmark and the live test, which is more than many prompting papers deliver.

The soft spot is the simulation step. Nothing in the abstract shows that the simulated student responses match the distribution of real high-school engagement or understanding, so the policy learned there could be tuned to artifacts that do not appear in the classroom. The fact that only the stochastic sampler improves conversion while the greedy router matches the baseline suggests the gain may come from added variety more than from the learned routing itself. Feature selection and baseline prompt matching are also not described.

This is for researchers working on deployed educational LLMs who want to see real-student numbers. A reader focused on prompt adaptation or tutoring systems can extract the empirical outcomes. It deserves peer review because the live experiment provides something to evaluate, even if the methods will need expansion to support the claims.

Referee Report

3 major / 1 minor

Summary. The paper claims that an LLM tutoring system using 14 pedagogical features extracted from transcripts, with a prompt router trained in simulation and deployed adaptively, outperforms static baselines in simulation (0.694 vs. 0.647/0.64, p<0.001) and yields higher exercise conversion (28.1% for stochastic router) plus ~3 fewer turns (p=0.007) in an A/B test of 656 conversations from 359 high-school students.

Significance. If the sim-to-real transfer holds, the work provides evidence that stochastic routing among pedagogical strategies can improve instructional efficiency in live LLM tutoring while preserving quality. The real-student A/B test with hundreds of participants is a methodological strength that supports potential applicability to high-school settings.

major comments (3)

[Simulation environment section] Simulation environment section: the router is trained exclusively in simulation before deployment, yet no quantitative sim-to-real diagnostics (feature distribution distances, response-model calibration error, or variance in student understanding) are reported; this assumption is load-bearing for the reported 28.1% conversion rate and p=0.007 result.
[A/B testing results paragraph] A/B testing results paragraph: the manuscript provides no detail on validation of the 14 features, the router training procedure, or whether baseline prompts were matched for length and style, leaving the central claim that the adaptive system reduces interactions by ~3 turns dependent on unshown methods.
[Methods section on feature extraction] Methods section on feature extraction: the claim that the 14 features (e.g., tutor scaffolding, student understanding) enable subject-aware prompting rests on their extraction from raw transcripts, but no evidence is given that these features were validated against real student engagement patterns.

minor comments (1)

[Abstract] Abstract: the N=656 conversations figure is stated but the split between conditions is not given, which would aid interpretation of the conversion-rate comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of methodological transparency and sim-to-real transfer. We respond to each major comment below.

read point-by-point responses

Referee: [Simulation environment section] Simulation environment section: the router is trained exclusively in simulation before deployment, yet no quantitative sim-to-real diagnostics (feature distribution distances, response-model calibration error, or variance in student understanding) are reported; this assumption is load-bearing for the reported 28.1% conversion rate and p=0.007 result.

Authors: We agree that quantitative sim-to-real diagnostics strengthen the transfer claims. In the revised manuscript we will add feature distribution comparisons (e.g., Wasserstein distances) between simulation and real transcripts, plus any available response-model calibration metrics. This directly addresses the load-bearing assumption for the deployment results. revision: yes
Referee: [A/B testing results paragraph] A/B testing results paragraph: the manuscript provides no detail on validation of the 14 features, the router training procedure, or whether baseline prompts were matched for length and style, leaving the central claim that the adaptive system reduces interactions by ~3 turns dependent on unshown methods.

Authors: We will expand the Methods and results sections to detail the 14-feature validation approach, the full router training procedure (including simulation hyperparameters and data generation), and explicit confirmation that baseline prompts were matched for length and stylistic tone. These additions will make the ~3-turn reduction claim fully supported by documented methods. revision: yes
Referee: [Methods section on feature extraction] Methods section on feature extraction: the claim that the 14 features (e.g., tutor scaffolding, student understanding) enable subject-aware prompting rests on their extraction from raw transcripts, but no evidence is given that these features were validated against real student engagement patterns.

Authors: The features were selected from pedagogical literature and implemented via LLM classifiers on transcripts. Their effectiveness is evidenced by the A/B test outcomes (higher conversion, shorter dialogues). We will add a dedicated paragraph on feature selection rationale and any post-hoc engagement correlations in the revision; a separate pre-deployment validation study was not performed. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results from independent sim training and real A/B test

full rationale

The paper reports training a router in a simulation environment followed by deployment and A/B testing on real students (N=359). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims (conversion rates, turn reduction) rest on separate empirical measurements rather than reducing to inputs by construction. This is a standard non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, model specifications, or data-processing details, preventing identification of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5739 in / 1134 out tokens · 24753 ms · 2026-06-26T17:45:03.532557+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 3 canonical work pages · 2 internal anchors

[1]

InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 272–292, Suzhou, China

From problem-solving to teaching problem- solving: Aligning LLMs with pedagogy using re- inforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 272–292, Suzhou, China. Association for Computational Linguistics. Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie- Yan Liu. 2019. Representati...

work page arXiv 2025
[2]

OpenAI GPT-5 System Card

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267. Pankaj Singh. 2026. Querywise prompt routing for large language models.International Journal of Research and Innovation in Social Science, 10(19):605–611. Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. 2024. A long way to go: Investigating length correlations in rlhf. Joar Skalse, Nik...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Large Language Models as Optimizers

Large language models as optimizers. volume abs/2309.03409. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In Proceedings of the 37th International Conference on ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

(False if fewer than two or off-topic)

Answer ≥ 2 questions: The user answers two or more questions posed by the assistant about the main topic. (False if fewer than two or off-topic)
[5]

(Exclude generic greetings)

Ask ≥ 2 on-topic questions: The user asks two or more questions (why, how, what, when) connected to the topic. (Exclude generic greetings)
[6]

thank you

Interact > 3 times: The user sends four or more substantive messages containing reasoning or topic-related inquiry. 4.Positive social exchanges: The user expresses positivity (e.g., “thank you”, emojis) at least twice
[7]

Answers mostly correct: The user’s responses align with explanations in most cases with few clear mistakes
[8]

Correct within 2 turns: The user provides a correct response within two attempts at least 75% of the time when prompted
[9]

Shows understanding: Relevant responses and follow-up questions demonstrate comprehension without repeated confusion
[10]

why” or “how

Shows curiosity: Asks at least one question that goes beyond basic requirements (explores “why” or “how”)
[11]

10.Assistant on topic: The assistant remains focused on the learning goal throughout the interaction

Justifies mistakes: After an error, the user either reflects on the reasoning or provides a corrected answer later. 10.Assistant on topic: The assistant remains focused on the learning goal throughout the interaction
[12]

(False if gives full answers immediately)

Assistant scaffolding: The tutor offers progressive, multi-turn guidance and adjusts help level when the student struggles. (False if gives full answers immediately). 12.Assistant diagnoses: The assistant identifies specific mistakes and provides tailored clarifications
[13]

Assistant balances: The assistant alternates between explaining and prompting, avoiding a monologue-style delivery
[14]

Empirical Best Prompt

Assistant adapts: The tutor changes behavior (e.g., more explanation after mistakes) based on student performance. I Pedagogical Criteria Weighting To ensure the AI feedback signal aligns with human pedagogical judgment, we correlated the 14 LLM- extracted features against an expert-labeled dataset (N= 138 ). Human experts evaluated sessions and label it ...

[1] [1]

InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 272–292, Suzhou, China

From problem-solving to teaching problem- solving: Aligning LLMs with pedagogy using re- inforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 272–292, Suzhou, China. Association for Computational Linguistics. Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie- Yan Liu. 2019. Representati...

work page arXiv 2025

[2] [2]

OpenAI GPT-5 System Card

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267. Pankaj Singh. 2026. Querywise prompt routing for large language models.International Journal of Research and Innovation in Social Science, 10(19):605–611. Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. 2024. A long way to go: Investigating length correlations in rlhf. Joar Skalse, Nik...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Large Language Models as Optimizers

Large language models as optimizers. volume abs/2309.03409. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. In Proceedings of the 37th International Conference on ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

(False if fewer than two or off-topic)

Answer ≥ 2 questions: The user answers two or more questions posed by the assistant about the main topic. (False if fewer than two or off-topic)

[5] [5]

(Exclude generic greetings)

Ask ≥ 2 on-topic questions: The user asks two or more questions (why, how, what, when) connected to the topic. (Exclude generic greetings)

[6] [6]

thank you

Interact > 3 times: The user sends four or more substantive messages containing reasoning or topic-related inquiry. 4.Positive social exchanges: The user expresses positivity (e.g., “thank you”, emojis) at least twice

[7] [7]

Answers mostly correct: The user’s responses align with explanations in most cases with few clear mistakes

[8] [8]

Correct within 2 turns: The user provides a correct response within two attempts at least 75% of the time when prompted

[9] [9]

Shows understanding: Relevant responses and follow-up questions demonstrate comprehension without repeated confusion

[10] [10]

why” or “how

Shows curiosity: Asks at least one question that goes beyond basic requirements (explores “why” or “how”)

[11] [11]

10.Assistant on topic: The assistant remains focused on the learning goal throughout the interaction

Justifies mistakes: After an error, the user either reflects on the reasoning or provides a corrected answer later. 10.Assistant on topic: The assistant remains focused on the learning goal throughout the interaction

[12] [12]

(False if gives full answers immediately)

Assistant scaffolding: The tutor offers progressive, multi-turn guidance and adjusts help level when the student struggles. (False if gives full answers immediately). 12.Assistant diagnoses: The assistant identifies specific mistakes and provides tailored clarifications

[13] [13]

Assistant balances: The assistant alternates between explaining and prompting, avoiding a monologue-style delivery

[14] [14]

Empirical Best Prompt

Assistant adapts: The tutor changes behavior (e.g., more explanation after mistakes) based on student performance. I Pedagogical Criteria Weighting To ensure the AI feedback signal aligns with human pedagogical judgment, we correlated the 14 LLM- extracted features against an expert-labeled dataset (N= 138 ). Human experts evaluated sessions and label it ...