arxiv: 2605.11155 · v1 · submitted 2026-05-11 · 💻 cs.CY

Recognition: no theorem link

Improving Hybrid Human-AI Tutoring by Differentiating Human Tutor Roles Based on Student Needs

Ashish Gurung , Ge Gao , Jordan Gutterman , Danielle R. Thomas , Shivang Gupta , Lee Branstetter , Emma Brunskill , Vincent Aleven

show 1 more author

Kenneth R. Koedinger

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:47 UTC · model grok-4.3

classification 💻 cs.CY

keywords hybrid human-AI tutoringdifferentiated tutoringproactive tutoringreactive tutoringdifference-in-discontinuity designstudent achievementachievement gapseducational technology

0 comments

The pith

Differentiated proactive and reactive human tutor roles in AI-hybrid systems raise time on task 25%, proficiency 36%, and academic growth 61% over AI-only tutoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether tailoring human involvement—proactive initiation for lower-performing students and reactive on-demand support for higher performers—makes hybrid human-AI tutoring more effective than AI alone. It assigns 635 students in grades 5-8 to one role or the other using their within-grade median state test score as the cutoff. A difference-in-discontinuity design compares fall AI-only outcomes to spring differentiated hybrid outcomes around that cutoff. Results show clear overall gains from the hybrid policy, with proactive tutoring delivering a marginal extra boost in standardized growth that helps students below the cutoff catch up. This matching of support to need points to a practical way to scale hybrid tutoring without uniform increases in human tutor time.

Core claim

Using a difference-in-discontinuity design, the authors demonstrate that implementing proactive human tutoring for students below the median state test score and reactive support for those at or above leads to substantial improvements over AI-only tutoring. Specifically, there is a 25% increase in time on task, 36% in skill proficiency, and 61% in academic growth as measured by the MAP test. While both approaches improve time and proficiency similarly, proactive tutoring shows a marginal advantage in MAP growth (75% higher, p=0.065), particularly for students below the cutoff, thereby helping to narrow achievement gaps.

What carries the argument

Difference-in-discontinuity (DiDC) design that compares the discontinuity in outcomes around the within-grade median state test score cutoff between the AI-only fall period and the differentiated hybrid spring period.

If this is right

Hybrid human-AI tutoring with differentiated roles outperforms AI-only tutoring across time on task, skill proficiency, and standardized growth.
Proactive and reactive human support produce comparable gains in engagement and proficiency for their respective groups.
Proactive tutoring provides additional MAP growth benefits that help narrow achievement gaps for students below the median.
The differentiated policy offers a cost-effective model for scaling hybrid tutoring by matching human effort to student need.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This cutoff-based assignment could be adapted using local or real-time performance data instead of annual state tests.
Dynamic switching between proactive and reactive modes might further optimize outcomes if performance changes mid-semester.
The approach raises the possibility of applying similar differentiation in other AI learning tools, such as adaptive homework platforms.
Tracking whether the gap-narrowing effect persists across multiple years would test longer-term equity gains.

Load-bearing premise

The within-grade median state test score cutoff cleanly separates students who benefit differently from proactive versus reactive human support, and the DiDC comparison isolates the policy effect without unmeasured time-varying confounders.

What would settle it

Observing no larger discontinuity in spring outcomes around the median cutoff than existed in fall, or finding that the extra MAP growth for proactive tutoring disappears after controlling for seasonal or other factors, would indicate the differentiated policy does not deliver the claimed benefits.

Figures

Figures reproduced from arXiv: 2605.11155 by Ashish Gurung, Danielle R. Thomas, Emma Brunskill, Ge Gao, Jordan Gutterman, Kenneth R. Koedinger, Lee Branstetter, Shivang Gupta, Vincent Aleven.

**Figure 1.** Figure 1: Simulated scenarios for RD (top panel) and DiDC (bottom panel). The top panel shows heterogeneous treatment [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Experimental timeline for evaluating a proactive-reactive tutoring policy. Fall IXL data and winter MAP scores serve [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Within-subject improvement of relative academic [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Within subject improvement of skill proficiency in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Examining the impact of proactive and reactive [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Hybrid human-AI tutoring, where technology and humans jointly facilitate student learning, can be more beneficial than AI-only tutoring. However, preliminary evidence suggests that lower-performing students derive greater benefit from human-AI tutoring than higher-performing students. As such, this study evaluates whether a differentiated tutoring policy can effectively support both groups: human tutors initiate support for lower-performing students, while higher-performing students receive reactive, on-demand support. Using their within-grade median state test scores, we assigned 635 students (grades 5-8) to receive proactive (< median) or reactive ($\geq$ median) tutoring. Using a DiDC design, we compare outcomes across two time periods: fall (AI-only tutoring) and spring (proactive-reactive human-AI tutoring). This quasi-experimental design isolates the effects of proactive-reactive tutoring approaches by comparing the discontinuity in spring outcomes to the fall, where no such discontinuity existed. Using data around the cutoff (Imbens-Kalyanaraman criterion), we find significant overall improvements from human-AI tutoring compared to AI-only baseline: 25% increase in time on task, 36% in skill proficiency, and 61% in academic growth (standardized MAP test). Between proactive and reactive tutoring, we find comparable improvements in time-on-task and skill proficiency. However, proactive tutoring, on average, showed marginally higher MAP growth (75%, p = .065) than reactive tutoring, i.e., proactive tutoring was more beneficial to students farther below the cutoff and helped narrow achievement gaps. Our findings provide evidence that differentiated human-AI tutoring addresses the needs of both groups, offering a practical and cost-effective strategy for scaling hybrid instruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows solid overall gains from adding human tutors to AI systems and a marginal hint that proactive help for lower performers narrows gaps, but the DiDC identification looks vulnerable to seasonal or curriculum shifts between fall and spring.

read the letter

This paper shows solid overall gains from adding human tutors to AI systems and a marginal hint that proactive help for lower performers narrows gaps, but the DiDC identification looks vulnerable to seasonal or curriculum shifts between fall and spring. They ran a quasi-experiment with 635 students in grades 5-8, splitting human support by within-grade median state test scores: proactive for those below, reactive for those at or above. Comparing fall AI-only to spring hybrid periods, they report 25% more time on task, 36% higher skill proficiency, and 61% better MAP growth overall. The proactive group showed a 75% MAP advantage over reactive at p=.065, which they interpret as helping close achievement gaps for students farther below the cutoff. Time-on-task and proficiency gains looked comparable across the two human roles. The design uses the Imbens-Kalyanaraman bandwidth to zoom in near the cutoff, which is a reasonable choice for this kind of quasi-experiment. What stands out is the direct test of differentiated tutor roles rather than just adding humans uniformly. The data come from real classroom use, and the effect sizes are large enough to matter for scaling decisions. The main soft spot is the core DiDC assumption. It requires that any fall-to-spring change in outcomes would have been continuous at the median cutoff without the policy change. Seasonal effects like end-of-year fatigue, testing pressure, or curriculum pacing could easily hit lower and higher performers differently and create a spurious jump. The p=.065 result is already marginal, so even small violations would weaken the gap-narrowing conclusion. The abstract does not lay out robustness checks for this, though the full paper might. This is worth a reading group if your group works on edtech or tutoring systems. It has practical value for anyone thinking about how to ration limited human time. The work deserves peer review because the question is real, the sample is decent, and the policy angle is clear, even if the differentiation claim needs tighter identification checks.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates a differentiated hybrid human-AI tutoring policy for 635 students in grades 5-8. Using within-grade median state test scores, students below the median receive proactive human support while those at or above receive reactive support. A difference-in-discontinuity (DiDC) design compares fall (AI-only) to spring (hybrid differentiated) outcomes around the cutoff selected via the Imbens-Kalyanaraman bandwidth. The study reports overall gains versus AI-only baseline of 25% in time on task, 36% in skill proficiency, and 61% in standardized MAP growth. Proactive and reactive approaches yield comparable gains in time-on-task and proficiency, but proactive tutoring shows a marginally higher MAP growth advantage (75%, p=.065), interpreted as narrowing achievement gaps for lower-performing students.

Significance. If the DiDC identification holds, the work provides practical evidence that tailoring human tutor roles in hybrid systems—proactive for lower performers and reactive for higher—can scale hybrid instruction while improving equity and outcomes. The quasi-experimental approach with explicit bandwidth selection is a methodological strength that supports causal claims about policy differentiation. This contributes to human-AI tutoring literature by demonstrating a cost-effective strategy that addresses heterogeneous student needs.

major comments (2)

The DiDC design attributes spring outcome discontinuities at the median cutoff to the proactive/reactive policy, but this requires that fall-to-spring changes would have been continuous at the cutoff absent the intervention. The manuscript reports the marginal p=.065 result for the proactive MAP advantage without detailing robustness checks (e.g., covariate discontinuity tests, placebo cutoffs, or sensitivity to bandwidth choice) that would support the no time-varying confounder assumption. (DiDC analysis and results)
The exact sample size of observations within the Imbens-Kalyanaraman bandwidth used for the local DiDC estimation is not reported. This information is load-bearing for assessing statistical power, precision of the 25%/36%/61% gains, and reliability of the p=.065 differential effect. (Results reporting)

minor comments (1)

Clarify the precise interpretation of the '75%' MAP growth figure for proactive versus reactive tutoring (e.g., relative percentage increase, standardized effect size, or model coefficient). (Abstract and results)

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for strengthening the causal claims and transparency of our DiDC analysis. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: The DiDC design attributes spring outcome discontinuities at the median cutoff to the proactive/reactive policy, but this requires that fall-to-spring changes would have been continuous at the cutoff absent the intervention. The manuscript reports the marginal p=.065 result for the proactive MAP advantage without detailing robustness checks (e.g., covariate discontinuity tests, placebo cutoffs, or sensitivity to bandwidth choice) that would support the no time-varying confounder assumption. (DiDC analysis and results)

Authors: We agree that explicit robustness checks are needed to bolster confidence in the identifying assumption of no time-varying confounders at the cutoff. In the revised manuscript, we will add: (1) covariate discontinuity tests for baseline characteristics (e.g., prior achievement, demographics) to verify balance; (2) placebo tests using alternative cutoffs (such as other within-grade quantiles) where no policy change occurred; and (3) sensitivity analyses to bandwidth choice around the Imbens-Kalyanaraman selection. These will be reported in a new subsection of the results to support the DiDC interpretation, including the marginal proactive MAP growth advantage. revision: yes
Referee: The exact sample size of observations within the Imbens-Kalyanaraman bandwidth used for the local DiDC estimation is not reported. This information is load-bearing for assessing statistical power, precision of the 25%/36%/61% gains, and reliability of the p=.065 differential effect. (Results reporting)

Authors: We concur that the precise number of observations within the selected bandwidth is critical for evaluating the estimates. We will report the exact sample sizes (overall and by outcome) used in the local DiDC estimation in the revised results section, along with the corresponding effective sample sizes after applying the Imbens-Kalyanaraman criterion. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical DiDC reporting with no derivation chain

full rationale

The paper presents measured outcomes from a quasi-experimental difference-in-discontinuity (DiDC) design comparing fall AI-only tutoring to spring differentiated hybrid tutoring, using within-grade median state test scores as the cutoff. No equations, fitted parameters, predictions, or self-citations are invoked to derive the central claims; the reported improvements (time on task, skill proficiency, MAP growth) are direct empirical results. The analysis is self-contained against external data benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of the quasi-experimental design and the assumption that the median cutoff captures differential needs; no free parameters are fitted in the reported results, and no new entities are postulated.

axioms (1)

domain assumption The DiDC design isolates the causal effect of the proactive-reactive policy by comparing the spring discontinuity to the fall baseline where no policy discontinuity existed.
Invoked in the description of the quasi-experimental comparison between time periods.

pith-pipeline@v0.9.0 · 5640 in / 1319 out tokens · 78830 ms · 2026-05-13T02:47:57.266522+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

[1]

Improving Hybrid Human-AI Tutoring by Differentiating Human Tutor Roles Based on Student Needs

INTRODUCTION The efficacy of human tutoring is well established in educa- tional research [43], with meta-analyses showing it to be par- ticularly beneficial for lower-performing and socioeconomi- cally disadvantaged students [19, 44]. However, scaling hu- man tutoring remains a persistent challenge due to cost and logistical constraints. Automated altern...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

When calculating the volume of a cylinder, do we use the area or the perimeter of the base?

BACKGROUND This section reviews the current evidence on the effectiveness of human tutoring, AI tutors, and human-AI tutoring. We also examine how researchers and practitioners have sought to scale these interventions. 2.1 Human Tutoring Bloom’s seminal 2-sigma effect [9] highlighted the impressive learning benefits of one-on-one human tutoring (under ide...

work page
[3]

proactive

PRESENT STUDY Building on evidence that lower-performing students benefit more from additional human tutor support (Section 2.3.2), we propose a proactive-reactive tutoring policy that matches the intensity of human support to student needs. Human tu- tors initiate additional motivational and conceptual support to lower-performing students and reactive, o...

work page
[4]

strong evidence

METHOD We use a DiDC design to estimate the effects of introducing human-AI tutoring (RQ1; DiD) and the causal effects of proactive versus reactive tutoring (RQ2; estimated at the median cutoff). 4.1 Dataset The dataset includes assignment logs from IXL (AI tutor) and standardized test scores for 635 students in grades 5–8 at a middle school in a Mid-Atla...

work page 2024
[5]

5.1 Descriptive Statistics The descriptive statistics for student performance are re- ported in Table 2

RESULTS The code used for the analysis is available in the anonymous GitHub repository 4. 5.1 Descriptive Statistics The descriptive statistics for student performance are re- ported in Table 2. These include prior year’s state test scores and the winter and spring MAP results. Table 2: The mean and SD of student performance on the state and MAP tests. Gr...

work page
[6]

The gray line represents the students’ relative MAP growth (1.3×) during the AI-only baseline period

DISCUSSION Given the heterogeneous treatment effects of human-AI tu- toring, we study a differentiated proactive–reactive tutor- ing policy in which lower-performing students receive tutor- initiated support, whereas higher-performing students re- 5http://tiny.cc/EDM26 Figure 6: Examining the impact of proactive and reactive tutoring on students’ relative...

work page
[7]

Similarly, our reliance on national growth norms as a reference point, rather than a matched control group, limits causal inference

LIMITA TIONS Our quasi-experimental DiDC design [46] provides evidence for the benefits of personalizing tutoring support, though stronger causal evidence could be achieved from randomized controlled trials. Similarly, our reliance on national growth norms as a reference point, rather than a matched control group, limits causal inference. However, ethical...

work page
[8]

FUTURE WORK The absence of Zoom interaction data (actual student–tutor interactions) represents a missed opportunity for deeper anal- ysis, especially for understanding which elements of proac- tive tutoring were effective for student learning. We pri- marily relied on lead tutors and tutor supervisors to en- sure implementation fidelity; better Zoom inte...

work page
[9]

CONCLUSION This study explored a key challenge in scaling human-AI tutoring: how to allocate limited human tutor support to effectively meet diverse student learning needs. Motivated by preliminary evidence that lower-performing students de- rive greater benefits from human-AI tutoring than higher- performing peers, we evaluated a differentiated proactive...

work page
[10]

ADDITIONAL AUTHORS Additional authors: Alex Houk (ahouk@andrew.cmu.edu), Erin Gatz (egatz@andrew.cmu.edu), and Boyuan (Bill) Guo (boyuang@cmu.edu)

work page
[11]

Aleven, R

V. Aleven, R. Baraniuk, E. Brunskill, S. Crossley, D. Demszky, S. Fancsali, S. Gupta, K. Koedinger, C. Piech, S. Ritter, et al. Towards the future of ai-augmented human tutoring in math learning. In International Conference on Artificial Intelligence in Education, pages 26–31. Springer, 2023

work page 2023
[12]

Aleven, B

V. Aleven, B. Mclaren, I. Roll, and K. Koedinger. Toward meta-cognitive tutoring: A model of help seeking with a cognitive tutor.International Journal of Artificial Intelligence in Education, 16(2):101–128, 2006

work page 2006
[13]

Aleven, E

V. Aleven, E. A. McLaughlin, R. A. Glenn, and K. R. Koedinger. Instruction based on adaptive learning technologies.Handbook of research on learning and instruction, 2:522–560, 2016

work page 2016
[14]

P. An, K. Holstein, B. d’Anjou, B. Eggen, and S. Bakker. The ta framework: Designing real-time teaching augmentation for k-12 classrooms. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–17, 2020

work page 2020
[15]

J. R. Anderson, A. T. Corbett, K. R. Koedinger, and R. Pelletier. Cognitive tutors: Lessons learned.The journal of the learning sciences, 4(2):167–207, 1995

work page 1995
[16]

R. S. Baker. Modeling and understanding students’ off-task behavior in intelligent tutoring systems. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 1059–1068, 2007

work page 2007
[17]

M. P. Bhatt, J. Guryan, S. A. Khan, M. LaForest-Tucker, and B. Mishra. Can technology facilitate scale? evidence from a randomized evaluation of high dosage tutoring. Technical report, National Bureau of Economic Research, 2024

work page 2024
[18]

Bleiberg, C

J. Bleiberg, C. D. Robinson, E. Bennett, and S. Loeb. The impact of tutor gender match on girls’ stem interest, engagement, and performance. 2025

work page 2025
[19]

B. S. Bloom. The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.Educational researcher, 13(6):4–16, 1984

work page 1984
[20]

Borchers, A

C. Borchers, A. Gurung, Q. Liu, D. R. Thomas, M. Khalil, and K. R. Koedinger. Brief but impactful: How human tutoring interactions shape engagement in online learning.arXiv preprint arXiv:2601.09994, 2026

work page arXiv 2026
[21]

Borchers, A

C. Borchers, A. Houk, V. Aleven, and K. R. Koedinger. Engagement and learning benefits of goal setting with rewards in human-ai tutoring. In International Conference on Artificial Intelligence in Education, pages 46–59. Springer, 2025

work page 2025
[22]

Calonico, M

S. Calonico, M. D. Cattaneo, and R. Titiunik. Robust nonparametric confidence intervals for regression-discontinuity designs.Econometrica, 82(6):2295–2326, 2014

work page 2014
[23]

M. T. Chi, S. A. Siler, H. Jeong, T. Yamauchi, and R. G. Hausmann. Learning from human tutoring. Cognitive science, 25(4):471–533, 2001

work page 2001
[24]

D. R. Chine, C. Brentley, C. Thomas-Browne, J. E. Richey, A. Gul, P. F. Carvalho, L. Branstetter, and K. R. Koedinger. Educational equity through combined human-ai personalization: A propensity matching evaluation. InInternational conference on artificial intelligence in education, pages 366–377. Springer, 2022

work page 2022
[25]

Copeland, M

S. Copeland, M. A. Cook, A. A. Grant, and S. M. Ross. Randomized-control efficacy study of ixl math in holland public schools.Center for Research and Reform in Education, 2023

work page 2023
[26]

A. T. Corbett and J. R. Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction, 4(4):253–278, 1994

work page 1994
[27]

M. G. Core, J. D. Moore, and C. Zinn. The role of initiative in tutorial dialogue. In10th Conference of the European Chapter of the Association for Computational Linguistics, 2003

work page 2003
[28]

Cosyn, H

E. Cosyn, H. Uzun, C. Doble, and J. Matayoshi. A practical perspective on knowledge space theory: Aleks and its data.Journal of Mathematical Psychology, 101:102512, 2021

work page 2021
[29]

Dietrichson, M

J. Dietrichson, M. Bøg, T. Filges, and A.-M. Klint Jørgensen. Academic interventions for elementary and middle school students with low socioeconomic status: A systematic review and meta-analysis.Review of educational research, 87(2):243–282, 2017

work page 2017
[30]

du Boulay

B. du Boulay. Recent meta-reviews and meta–analyses of aied systems.International Journal of Artificial Intelligence in Education, 26(1):536–537, 2016

work page 2016
[31]

Y. Fang, Z. Ren, X. Hu, and A. C. Graesser. A meta-analysis of the effectiveness of aleks on learning. Educational Psychology, 39(10):1278–1292, 2019

work page 2019
[32]

Forbes-Riley and D

K. Forbes-Riley and D. J. Litman. Analyzing dependencies between student certainness states and tutor responses in a spoken dialogue corpus. InRecent Trends in Discourse and Dialogue, pages 275–304. Springer, 2008

work page 2008
[33]

B. A. Fox.The Human Tutorial Dialogue Project: Issues in the design of instructional systems. CRC Press, 2020

work page 2020
[34]

Goodman-Bacon

A. Goodman-Bacon. Difference-in-differences with variation in treatment timing.Journal of econometrics, 225(2):254–277, 2021

work page 2021
[35]

A. A. Grant, M. A. Cook, and S. M. Ross. The impacts of i-ready personalized instruction on student math achievement.Center for Research and Reform in Education, 2023

work page 2023
[36]

Gurung, J

A. Gurung, J. Lin, J. Gutterman, D. R. Thomas, A. Houk, S. Gupta, E. Brunskill, L. Branstetter, V. Aleven, and K. Koedinger. Human tutoring improves the impact of ai tutor use on learning outcomes. InInternational Conference on Artificial Intelligence in Education, pages 393–407. Springer, 2025

work page 2025
[37]

Gurung, J

A. Gurung, J. Lin, Z. Huang, C. Borchers, R. S. Baker, V. Aleven, and K. R. Koedinger. Starting seatwork earlier as a valid measure of student engagement.arXiv preprint arXiv:2505.13341, 2025

work page arXiv 2025
[38]

Guryan, J

J. Guryan, J. Ludwig, M. P. Bhatt, P. J. Cook, J. M. Davis, K. Dodge, G. Farkas, R. G. Fryer Jr, S. Mayer, H. Pollack, et al. Not too late: Improving academic outcomes among adolescents.American Economic Review, 113(3):738–765, 2023

work page 2023
[39]

He and J

W. He and J. Meyer. Map growth universal screening benchmarks: Establishing map growth as an effective universal screener.Northwest Evaluation Association, 2021

work page 2021
[40]

N. T. Heffernan and C. L. Heffernan. The assistments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching.International Journal of Artificial Intelligence in Education, 24(4):470–497, 2014

work page 2014
[41]

Holstein, B

K. Holstein, B. M. McLaren, and V. Aleven. Student learning benefits of a mixed-reality teacher awareness tool in ai-enhanced classrooms. InInternational conference on artificial intelligence in education, pages 154–168. Springer, 2018

work page 2018
[42]

Imbens and K

G. Imbens and K. Kalyanaraman. Optimal bandwidth choice for the regression discontinuity estimator.The Review of economic studies, 79(3):933–959, 2012

work page 2012
[43]

G. W. Imbens and T. Lemieux. Regression discontinuity designs: A guide to practice.Journal of econometrics, 142(2):615–635, 2008

work page 2008
[44]

Measuring the impact of ixl math and ixl language arts in pennsylvania schools

IXL Learning. Measuring the impact of ixl math and ixl language arts in pennsylvania schools. Technical report, IXL Learning, 2020. Retrieved July 6, 2025

work page 2020
[45]

S. Katz, P. Albacete, I.-A. Chounta, P. Jordan, B. M. McLaren, and D. Zapata-Rivera. Linking dialogue with student modelling to create an adaptive tutoring system for conceptual physics.International journal of artificial intelligence in education, 31(3):397–445, 2021

work page 2021
[46]

J. A. Kulik and J. D. Fletcher. Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of educational research, 86(1):42–78, 2016

work page 2016
[47]

J. Lin, E. Chen, Z. Han, A. Gurung, D. R. Thomas, W. Tan, N. D. Nguyen, and K. R. Koedinger. How can i improve? using gpt to highlight the desired and undesired parts of open-ended responses.arXiv preprint arXiv:2405.00291, 2024

work page arXiv 2024
[48]

J. Lin, Z. Han, D. R. Thomas, A. Gurung, S. Gupta, V. Aleven, and K. R. Koedinger. How can i get it right? using gpt to rephrase incorrect trainee responses.International journal of artificial intelligence in education, 35(2):482–508, 2025

work page 2025
[49]

J. McCrary. Manipulation of the running variable in the regression discontinuity design: A density test. Journal of econometrics, 142(2):698–714, 2008

work page 2008
[50]

D. S. McNamara, I. B. Levinstein, and C. Boonthum. istart: Interactive strategy training for active reading and thinking.Behavior Research Methods, Instruments, & Computers, 36(2):222–233, 2004

work page 2004
[51]

Mitrovic

A. Mitrovic. An intelligent sql tutor on the web. International Journal of Artificial Intelligence in Education, 13(2-4):173–197, 2003

work page 2003
[52]

Morrow and M

J. Morrow and M. Ackermann. Intention to persist and retention of first-year students: The importance of motivation and sense of belonging.College student journal, 46(3):483–491, 2012

work page 2012
[53]

Muldner, R

K. Muldner, R. Lam, and M. T. Chi. Comparing learning from observing and from human tutoring. Journal of Educational Psychology, 106(1):69, 2014

work page 2014
[54]

Nickow, P

A. Nickow, P. Oreopoulos, and V. Quan. The impressive effects of tutoring on prek-12 learning: A systematic review and meta-analysis of the experimental evidence. 2020

work page 2020
[55]

O’Keeffe

P. O’Keeffe. A sense of belonging: Improving student retention.College student journal, 47(4):605–613, 2013

work page 2013
[56]

Picchetti, C

P. Picchetti, C. C. Pinto, and S. T. Shinoki. Difference-in-discontinuities: estimation, inference and validity tests.arXiv preprint arXiv:2405.18531, 2024

work page arXiv 2024
[57]

T. W. Price, Y. Dong, and D. Lipovac. isnap: towards intelligent tutoring in novice programming environments. InProceedings of the 2017 ACM SIGCSE Technical Symposium on computer science education, pages 483–488, 2017

work page 2017
[58]

D. D. Ready, S. G. McCormick, and R. J. Shmoys. The effects of in-school virtual tutoring on student reading development: Evidence from a short-cycle randomized controlled trial.Journal of Education for Students Placed at Risk (JESPAR), pages 1–21, 2026

work page 2026
[59]

C. D. Robinson, C. Pollard, S. Novicoff, S. White, and S. Loeb. The effects of virtual tutoring on young readers: Results from a randomized controlled trial. Educational Evaluation and Policy Analysis, 47(4):1245–1265, 2025

work page 2025
[60]

R. D. Roscoe and D. S. McNamara. Writing pal: Feasibility of an intelligent writing strategy tutor in the high school classroom.Journal of Educational Psychology, 105(4):1010, 2013

work page 2013
[61]

Schonberg

C. Schonberg. The impact of ixl on math and ela learning in pennsylvania. Technical report, 2025. Retrieved July 6, 2025

work page 2025
[62]

Shute, S

V. Shute, S. Lajoie, and K. Gluck. Individualized and group approaches to training.Training and retraining: A handbook for business, industry, government, and the military, pages 171–207, 2000

work page 2000
[63]

V. J. Shute, G. Smith, R. Kuba, C.-P. Dai, S. Rahimi, Z. Liu, and R. Almond. The design, development, and testing of learning supports for the physics playground game.International Journal of Artificial Intelligence in Education, 31(3):357–379, 2021

work page 2021
[64]

Steenbergen-Hu and H

S. Steenbergen-Hu and H. Cooper. A meta-analysis of the effectiveness of intelligent tutoring systems on k–12 students’ mathematical learning.Journal of educational psychology, 105(4):970, 2013

work page 2013
[65]

D. R. Thomas, J. Lin, E. Gatz, A. Gurung, S. Gupta, K. Norberg, S. E. Fancsali, V. Aleven, L. Branstetter, E. Brunskill, et al. Improving student learning with hybrid human-ai tutoring: A three-study quasi-experimental investigation. InProceedings of the 14th Learning Analytics and Knowledge Conference, pages 404–415, 2024

work page 2024
[66]

K. VanLehn. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems.Educational psychologist, 46(4):197–221, 2011

work page 2011
[67]

Waalkens, V

M. Waalkens, V. Aleven, and N. Taatgen. Does supporting multiple student strategies lead to greater learning and motivation? investigating a source of complexity in the architecture of intelligent tutoring systems.Computers & Education, 60(1):159–171, 2013

work page 2013
[68]

K. B. Yang, V. Echeverria, Z. Lu, H. Mao, K. Holstein, N. Rummel, and V. Aleven. Pair-up: prototyping human-ai co-orchestration of dynamic transitions between individual and collaborative learning in the classroom. InProceedings of the 2023 CHI conference on human factors in computing systems, pages 1–17, 2023

work page 2023
[69]

K. B. Yang, V. Echeverria, X. Wang, L. Lawrence, K. Holstein, N. Rummel, and V. Aleven. Exploring policies for dynamically teaming up students through log data simulation.International Educational Data Mining Society, 2021

work page 2021