pith. machine review for the scientific record. sign in

arxiv: 2605.11155 · v1 · submitted 2026-05-11 · 💻 cs.CY

Recognition: no theorem link

Improving Hybrid Human-AI Tutoring by Differentiating Human Tutor Roles Based on Student Needs

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:47 UTC · model grok-4.3

classification 💻 cs.CY
keywords hybrid human-AI tutoringdifferentiated tutoringproactive tutoringreactive tutoringdifference-in-discontinuity designstudent achievementachievement gapseducational technology
0
0 comments X

The pith

Differentiated proactive and reactive human tutor roles in AI-hybrid systems raise time on task 25%, proficiency 36%, and academic growth 61% over AI-only tutoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether tailoring human involvement—proactive initiation for lower-performing students and reactive on-demand support for higher performers—makes hybrid human-AI tutoring more effective than AI alone. It assigns 635 students in grades 5-8 to one role or the other using their within-grade median state test score as the cutoff. A difference-in-discontinuity design compares fall AI-only outcomes to spring differentiated hybrid outcomes around that cutoff. Results show clear overall gains from the hybrid policy, with proactive tutoring delivering a marginal extra boost in standardized growth that helps students below the cutoff catch up. This matching of support to need points to a practical way to scale hybrid tutoring without uniform increases in human tutor time.

Core claim

Using a difference-in-discontinuity design, the authors demonstrate that implementing proactive human tutoring for students below the median state test score and reactive support for those at or above leads to substantial improvements over AI-only tutoring. Specifically, there is a 25% increase in time on task, 36% in skill proficiency, and 61% in academic growth as measured by the MAP test. While both approaches improve time and proficiency similarly, proactive tutoring shows a marginal advantage in MAP growth (75% higher, p=0.065), particularly for students below the cutoff, thereby helping to narrow achievement gaps.

What carries the argument

Difference-in-discontinuity (DiDC) design that compares the discontinuity in outcomes around the within-grade median state test score cutoff between the AI-only fall period and the differentiated hybrid spring period.

If this is right

  • Hybrid human-AI tutoring with differentiated roles outperforms AI-only tutoring across time on task, skill proficiency, and standardized growth.
  • Proactive and reactive human support produce comparable gains in engagement and proficiency for their respective groups.
  • Proactive tutoring provides additional MAP growth benefits that help narrow achievement gaps for students below the median.
  • The differentiated policy offers a cost-effective model for scaling hybrid tutoring by matching human effort to student need.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This cutoff-based assignment could be adapted using local or real-time performance data instead of annual state tests.
  • Dynamic switching between proactive and reactive modes might further optimize outcomes if performance changes mid-semester.
  • The approach raises the possibility of applying similar differentiation in other AI learning tools, such as adaptive homework platforms.
  • Tracking whether the gap-narrowing effect persists across multiple years would test longer-term equity gains.

Load-bearing premise

The within-grade median state test score cutoff cleanly separates students who benefit differently from proactive versus reactive human support, and the DiDC comparison isolates the policy effect without unmeasured time-varying confounders.

What would settle it

Observing no larger discontinuity in spring outcomes around the median cutoff than existed in fall, or finding that the extra MAP growth for proactive tutoring disappears after controlling for seasonal or other factors, would indicate the differentiated policy does not deliver the claimed benefits.

Figures

Figures reproduced from arXiv: 2605.11155 by Ashish Gurung, Danielle R. Thomas, Emma Brunskill, Ge Gao, Jordan Gutterman, Kenneth R. Koedinger, Lee Branstetter, Shivang Gupta, Vincent Aleven.

Figure 1
Figure 1. Figure 1: Simulated scenarios for RD (top panel) and DiDC (bottom panel). The top panel shows heterogeneous treatment [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experimental timeline for evaluating a proactive-reactive tutoring policy. Fall IXL data and winter MAP scores serve [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Within-subject improvement of relative academic [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Within subject improvement of skill proficiency in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examining the impact of proactive and reactive [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Hybrid human-AI tutoring, where technology and humans jointly facilitate student learning, can be more beneficial than AI-only tutoring. However, preliminary evidence suggests that lower-performing students derive greater benefit from human-AI tutoring than higher-performing students. As such, this study evaluates whether a differentiated tutoring policy can effectively support both groups: human tutors initiate support for lower-performing students, while higher-performing students receive reactive, on-demand support. Using their within-grade median state test scores, we assigned 635 students (grades 5-8) to receive proactive (< median) or reactive ($\geq$ median) tutoring. Using a DiDC design, we compare outcomes across two time periods: fall (AI-only tutoring) and spring (proactive-reactive human-AI tutoring). This quasi-experimental design isolates the effects of proactive-reactive tutoring approaches by comparing the discontinuity in spring outcomes to the fall, where no such discontinuity existed. Using data around the cutoff (Imbens-Kalyanaraman criterion), we find significant overall improvements from human-AI tutoring compared to AI-only baseline: 25% increase in time on task, 36% in skill proficiency, and 61% in academic growth (standardized MAP test). Between proactive and reactive tutoring, we find comparable improvements in time-on-task and skill proficiency. However, proactive tutoring, on average, showed marginally higher MAP growth (75%, p = .065) than reactive tutoring, i.e., proactive tutoring was more beneficial to students farther below the cutoff and helped narrow achievement gaps. Our findings provide evidence that differentiated human-AI tutoring addresses the needs of both groups, offering a practical and cost-effective strategy for scaling hybrid instruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates a differentiated hybrid human-AI tutoring policy for 635 students in grades 5-8. Using within-grade median state test scores, students below the median receive proactive human support while those at or above receive reactive support. A difference-in-discontinuity (DiDC) design compares fall (AI-only) to spring (hybrid differentiated) outcomes around the cutoff selected via the Imbens-Kalyanaraman bandwidth. The study reports overall gains versus AI-only baseline of 25% in time on task, 36% in skill proficiency, and 61% in standardized MAP growth. Proactive and reactive approaches yield comparable gains in time-on-task and proficiency, but proactive tutoring shows a marginally higher MAP growth advantage (75%, p=.065), interpreted as narrowing achievement gaps for lower-performing students.

Significance. If the DiDC identification holds, the work provides practical evidence that tailoring human tutor roles in hybrid systems—proactive for lower performers and reactive for higher—can scale hybrid instruction while improving equity and outcomes. The quasi-experimental approach with explicit bandwidth selection is a methodological strength that supports causal claims about policy differentiation. This contributes to human-AI tutoring literature by demonstrating a cost-effective strategy that addresses heterogeneous student needs.

major comments (2)
  1. The DiDC design attributes spring outcome discontinuities at the median cutoff to the proactive/reactive policy, but this requires that fall-to-spring changes would have been continuous at the cutoff absent the intervention. The manuscript reports the marginal p=.065 result for the proactive MAP advantage without detailing robustness checks (e.g., covariate discontinuity tests, placebo cutoffs, or sensitivity to bandwidth choice) that would support the no time-varying confounder assumption. (DiDC analysis and results)
  2. The exact sample size of observations within the Imbens-Kalyanaraman bandwidth used for the local DiDC estimation is not reported. This information is load-bearing for assessing statistical power, precision of the 25%/36%/61% gains, and reliability of the p=.065 differential effect. (Results reporting)
minor comments (1)
  1. Clarify the precise interpretation of the '75%' MAP growth figure for proactive versus reactive tutoring (e.g., relative percentage increase, standardized effect size, or model coefficient). (Abstract and results)

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for strengthening the causal claims and transparency of our DiDC analysis. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: The DiDC design attributes spring outcome discontinuities at the median cutoff to the proactive/reactive policy, but this requires that fall-to-spring changes would have been continuous at the cutoff absent the intervention. The manuscript reports the marginal p=.065 result for the proactive MAP advantage without detailing robustness checks (e.g., covariate discontinuity tests, placebo cutoffs, or sensitivity to bandwidth choice) that would support the no time-varying confounder assumption. (DiDC analysis and results)

    Authors: We agree that explicit robustness checks are needed to bolster confidence in the identifying assumption of no time-varying confounders at the cutoff. In the revised manuscript, we will add: (1) covariate discontinuity tests for baseline characteristics (e.g., prior achievement, demographics) to verify balance; (2) placebo tests using alternative cutoffs (such as other within-grade quantiles) where no policy change occurred; and (3) sensitivity analyses to bandwidth choice around the Imbens-Kalyanaraman selection. These will be reported in a new subsection of the results to support the DiDC interpretation, including the marginal proactive MAP growth advantage. revision: yes

  2. Referee: The exact sample size of observations within the Imbens-Kalyanaraman bandwidth used for the local DiDC estimation is not reported. This information is load-bearing for assessing statistical power, precision of the 25%/36%/61% gains, and reliability of the p=.065 differential effect. (Results reporting)

    Authors: We concur that the precise number of observations within the selected bandwidth is critical for evaluating the estimates. We will report the exact sample sizes (overall and by outcome) used in the local DiDC estimation in the revised results section, along with the corresponding effective sample sizes after applying the Imbens-Kalyanaraman criterion. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical DiDC reporting with no derivation chain

full rationale

The paper presents measured outcomes from a quasi-experimental difference-in-discontinuity (DiDC) design comparing fall AI-only tutoring to spring differentiated hybrid tutoring, using within-grade median state test scores as the cutoff. No equations, fitted parameters, predictions, or self-citations are invoked to derive the central claims; the reported improvements (time on task, skill proficiency, MAP growth) are direct empirical results. The analysis is self-contained against external data benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of the quasi-experimental design and the assumption that the median cutoff captures differential needs; no free parameters are fitted in the reported results, and no new entities are postulated.

axioms (1)
  • domain assumption The DiDC design isolates the causal effect of the proactive-reactive policy by comparing the spring discontinuity to the fall baseline where no policy discontinuity existed.
    Invoked in the description of the quasi-experimental comparison between time periods.

pith-pipeline@v0.9.0 · 5640 in / 1319 out tokens · 78830 ms · 2026-05-13T02:47:57.266522+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

  1. [1]

    Improving Hybrid Human-AI Tutoring by Differentiating Human Tutor Roles Based on Student Needs

    INTRODUCTION The efficacy of human tutoring is well established in educa- tional research [43], with meta-analyses showing it to be par- ticularly beneficial for lower-performing and socioeconomi- cally disadvantaged students [19, 44]. However, scaling hu- man tutoring remains a persistent challenge due to cost and logistical constraints. Automated altern...

  2. [2]

    When calculating the volume of a cylinder, do we use the area or the perimeter of the base?

    BACKGROUND This section reviews the current evidence on the effectiveness of human tutoring, AI tutors, and human-AI tutoring. We also examine how researchers and practitioners have sought to scale these interventions. 2.1 Human Tutoring Bloom’s seminal 2-sigma effect [9] highlighted the impressive learning benefits of one-on-one human tutoring (under ide...

  3. [3]

    proactive

    PRESENT STUDY Building on evidence that lower-performing students benefit more from additional human tutor support (Section 2.3.2), we propose a proactive-reactive tutoring policy that matches the intensity of human support to student needs. Human tu- tors initiate additional motivational and conceptual support to lower-performing students and reactive, o...

  4. [4]

    strong evidence

    METHOD We use a DiDC design to estimate the effects of introducing human-AI tutoring (RQ1; DiD) and the causal effects of proactive versus reactive tutoring (RQ2; estimated at the median cutoff). 4.1 Dataset The dataset includes assignment logs from IXL (AI tutor) and standardized test scores for 635 students in grades 5–8 at a middle school in a Mid-Atla...

  5. [5]

    5.1 Descriptive Statistics The descriptive statistics for student performance are re- ported in Table 2

    RESULTS The code used for the analysis is available in the anonymous GitHub repository 4. 5.1 Descriptive Statistics The descriptive statistics for student performance are re- ported in Table 2. These include prior year’s state test scores and the winter and spring MAP results. Table 2: The mean and SD of student performance on the state and MAP tests. Gr...

  6. [6]

    The gray line represents the students’ relative MAP growth (1.3×) during the AI-only baseline period

    DISCUSSION Given the heterogeneous treatment effects of human-AI tu- toring, we study a differentiated proactive–reactive tutor- ing policy in which lower-performing students receive tutor- initiated support, whereas higher-performing students re- 5http://tiny.cc/EDM26 Figure 6: Examining the impact of proactive and reactive tutoring on students’ relative...

  7. [7]

    Similarly, our reliance on national growth norms as a reference point, rather than a matched control group, limits causal inference

    LIMITA TIONS Our quasi-experimental DiDC design [46] provides evidence for the benefits of personalizing tutoring support, though stronger causal evidence could be achieved from randomized controlled trials. Similarly, our reliance on national growth norms as a reference point, rather than a matched control group, limits causal inference. However, ethical...

  8. [8]

    FUTURE WORK The absence of Zoom interaction data (actual student–tutor interactions) represents a missed opportunity for deeper anal- ysis, especially for understanding which elements of proac- tive tutoring were effective for student learning. We pri- marily relied on lead tutors and tutor supervisors to en- sure implementation fidelity; better Zoom inte...

  9. [9]

    CONCLUSION This study explored a key challenge in scaling human-AI tutoring: how to allocate limited human tutor support to effectively meet diverse student learning needs. Motivated by preliminary evidence that lower-performing students de- rive greater benefits from human-AI tutoring than higher- performing peers, we evaluated a differentiated proactive...

  10. [10]

    ADDITIONAL AUTHORS Additional authors: Alex Houk (ahouk@andrew.cmu.edu), Erin Gatz (egatz@andrew.cmu.edu), and Boyuan (Bill) Guo (boyuang@cmu.edu)

  11. [11]

    Aleven, R

    V. Aleven, R. Baraniuk, E. Brunskill, S. Crossley, D. Demszky, S. Fancsali, S. Gupta, K. Koedinger, C. Piech, S. Ritter, et al. Towards the future of ai-augmented human tutoring in math learning. In International Conference on Artificial Intelligence in Education, pages 26–31. Springer, 2023

  12. [12]

    Aleven, B

    V. Aleven, B. Mclaren, I. Roll, and K. Koedinger. Toward meta-cognitive tutoring: A model of help seeking with a cognitive tutor.International Journal of Artificial Intelligence in Education, 16(2):101–128, 2006

  13. [13]

    Aleven, E

    V. Aleven, E. A. McLaughlin, R. A. Glenn, and K. R. Koedinger. Instruction based on adaptive learning technologies.Handbook of research on learning and instruction, 2:522–560, 2016

  14. [14]

    P. An, K. Holstein, B. d’Anjou, B. Eggen, and S. Bakker. The ta framework: Designing real-time teaching augmentation for k-12 classrooms. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–17, 2020

  15. [15]

    J. R. Anderson, A. T. Corbett, K. R. Koedinger, and R. Pelletier. Cognitive tutors: Lessons learned.The journal of the learning sciences, 4(2):167–207, 1995

  16. [16]

    R. S. Baker. Modeling and understanding students’ off-task behavior in intelligent tutoring systems. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 1059–1068, 2007

  17. [17]

    M. P. Bhatt, J. Guryan, S. A. Khan, M. LaForest-Tucker, and B. Mishra. Can technology facilitate scale? evidence from a randomized evaluation of high dosage tutoring. Technical report, National Bureau of Economic Research, 2024

  18. [18]

    Bleiberg, C

    J. Bleiberg, C. D. Robinson, E. Bennett, and S. Loeb. The impact of tutor gender match on girls’ stem interest, engagement, and performance. 2025

  19. [19]

    B. S. Bloom. The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.Educational researcher, 13(6):4–16, 1984

  20. [20]

    Borchers, A

    C. Borchers, A. Gurung, Q. Liu, D. R. Thomas, M. Khalil, and K. R. Koedinger. Brief but impactful: How human tutoring interactions shape engagement in online learning.arXiv preprint arXiv:2601.09994, 2026

  21. [21]

    Borchers, A

    C. Borchers, A. Houk, V. Aleven, and K. R. Koedinger. Engagement and learning benefits of goal setting with rewards in human-ai tutoring. In International Conference on Artificial Intelligence in Education, pages 46–59. Springer, 2025

  22. [22]

    Calonico, M

    S. Calonico, M. D. Cattaneo, and R. Titiunik. Robust nonparametric confidence intervals for regression-discontinuity designs.Econometrica, 82(6):2295–2326, 2014

  23. [23]

    M. T. Chi, S. A. Siler, H. Jeong, T. Yamauchi, and R. G. Hausmann. Learning from human tutoring. Cognitive science, 25(4):471–533, 2001

  24. [24]

    D. R. Chine, C. Brentley, C. Thomas-Browne, J. E. Richey, A. Gul, P. F. Carvalho, L. Branstetter, and K. R. Koedinger. Educational equity through combined human-ai personalization: A propensity matching evaluation. InInternational conference on artificial intelligence in education, pages 366–377. Springer, 2022

  25. [25]

    Copeland, M

    S. Copeland, M. A. Cook, A. A. Grant, and S. M. Ross. Randomized-control efficacy study of ixl math in holland public schools.Center for Research and Reform in Education, 2023

  26. [26]

    A. T. Corbett and J. R. Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction, 4(4):253–278, 1994

  27. [27]

    M. G. Core, J. D. Moore, and C. Zinn. The role of initiative in tutorial dialogue. In10th Conference of the European Chapter of the Association for Computational Linguistics, 2003

  28. [28]

    Cosyn, H

    E. Cosyn, H. Uzun, C. Doble, and J. Matayoshi. A practical perspective on knowledge space theory: Aleks and its data.Journal of Mathematical Psychology, 101:102512, 2021

  29. [29]

    Dietrichson, M

    J. Dietrichson, M. Bøg, T. Filges, and A.-M. Klint Jørgensen. Academic interventions for elementary and middle school students with low socioeconomic status: A systematic review and meta-analysis.Review of educational research, 87(2):243–282, 2017

  30. [30]

    du Boulay

    B. du Boulay. Recent meta-reviews and meta–analyses of aied systems.International Journal of Artificial Intelligence in Education, 26(1):536–537, 2016

  31. [31]

    Y. Fang, Z. Ren, X. Hu, and A. C. Graesser. A meta-analysis of the effectiveness of aleks on learning. Educational Psychology, 39(10):1278–1292, 2019

  32. [32]

    Forbes-Riley and D

    K. Forbes-Riley and D. J. Litman. Analyzing dependencies between student certainness states and tutor responses in a spoken dialogue corpus. InRecent Trends in Discourse and Dialogue, pages 275–304. Springer, 2008

  33. [33]

    B. A. Fox.The Human Tutorial Dialogue Project: Issues in the design of instructional systems. CRC Press, 2020

  34. [34]

    Goodman-Bacon

    A. Goodman-Bacon. Difference-in-differences with variation in treatment timing.Journal of econometrics, 225(2):254–277, 2021

  35. [35]

    A. A. Grant, M. A. Cook, and S. M. Ross. The impacts of i-ready personalized instruction on student math achievement.Center for Research and Reform in Education, 2023

  36. [36]

    Gurung, J

    A. Gurung, J. Lin, J. Gutterman, D. R. Thomas, A. Houk, S. Gupta, E. Brunskill, L. Branstetter, V. Aleven, and K. Koedinger. Human tutoring improves the impact of ai tutor use on learning outcomes. InInternational Conference on Artificial Intelligence in Education, pages 393–407. Springer, 2025

  37. [37]

    Gurung, J

    A. Gurung, J. Lin, Z. Huang, C. Borchers, R. S. Baker, V. Aleven, and K. R. Koedinger. Starting seatwork earlier as a valid measure of student engagement.arXiv preprint arXiv:2505.13341, 2025

  38. [38]

    Guryan, J

    J. Guryan, J. Ludwig, M. P. Bhatt, P. J. Cook, J. M. Davis, K. Dodge, G. Farkas, R. G. Fryer Jr, S. Mayer, H. Pollack, et al. Not too late: Improving academic outcomes among adolescents.American Economic Review, 113(3):738–765, 2023

  39. [39]

    He and J

    W. He and J. Meyer. Map growth universal screening benchmarks: Establishing map growth as an effective universal screener.Northwest Evaluation Association, 2021

  40. [40]

    N. T. Heffernan and C. L. Heffernan. The assistments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching.International Journal of Artificial Intelligence in Education, 24(4):470–497, 2014

  41. [41]

    Holstein, B

    K. Holstein, B. M. McLaren, and V. Aleven. Student learning benefits of a mixed-reality teacher awareness tool in ai-enhanced classrooms. InInternational conference on artificial intelligence in education, pages 154–168. Springer, 2018

  42. [42]

    Imbens and K

    G. Imbens and K. Kalyanaraman. Optimal bandwidth choice for the regression discontinuity estimator.The Review of economic studies, 79(3):933–959, 2012

  43. [43]

    G. W. Imbens and T. Lemieux. Regression discontinuity designs: A guide to practice.Journal of econometrics, 142(2):615–635, 2008

  44. [44]

    Measuring the impact of ixl math and ixl language arts in pennsylvania schools

    IXL Learning. Measuring the impact of ixl math and ixl language arts in pennsylvania schools. Technical report, IXL Learning, 2020. Retrieved July 6, 2025

  45. [45]

    S. Katz, P. Albacete, I.-A. Chounta, P. Jordan, B. M. McLaren, and D. Zapata-Rivera. Linking dialogue with student modelling to create an adaptive tutoring system for conceptual physics.International journal of artificial intelligence in education, 31(3):397–445, 2021

  46. [46]

    J. A. Kulik and J. D. Fletcher. Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of educational research, 86(1):42–78, 2016

  47. [47]

    J. Lin, E. Chen, Z. Han, A. Gurung, D. R. Thomas, W. Tan, N. D. Nguyen, and K. R. Koedinger. How can i improve? using gpt to highlight the desired and undesired parts of open-ended responses.arXiv preprint arXiv:2405.00291, 2024

  48. [48]

    J. Lin, Z. Han, D. R. Thomas, A. Gurung, S. Gupta, V. Aleven, and K. R. Koedinger. How can i get it right? using gpt to rephrase incorrect trainee responses.International journal of artificial intelligence in education, 35(2):482–508, 2025

  49. [49]

    J. McCrary. Manipulation of the running variable in the regression discontinuity design: A density test. Journal of econometrics, 142(2):698–714, 2008

  50. [50]

    D. S. McNamara, I. B. Levinstein, and C. Boonthum. istart: Interactive strategy training for active reading and thinking.Behavior Research Methods, Instruments, & Computers, 36(2):222–233, 2004

  51. [51]

    Mitrovic

    A. Mitrovic. An intelligent sql tutor on the web. International Journal of Artificial Intelligence in Education, 13(2-4):173–197, 2003

  52. [52]

    Morrow and M

    J. Morrow and M. Ackermann. Intention to persist and retention of first-year students: The importance of motivation and sense of belonging.College student journal, 46(3):483–491, 2012

  53. [53]

    Muldner, R

    K. Muldner, R. Lam, and M. T. Chi. Comparing learning from observing and from human tutoring. Journal of Educational Psychology, 106(1):69, 2014

  54. [54]

    Nickow, P

    A. Nickow, P. Oreopoulos, and V. Quan. The impressive effects of tutoring on prek-12 learning: A systematic review and meta-analysis of the experimental evidence. 2020

  55. [55]

    O’Keeffe

    P. O’Keeffe. A sense of belonging: Improving student retention.College student journal, 47(4):605–613, 2013

  56. [56]

    Picchetti, C

    P. Picchetti, C. C. Pinto, and S. T. Shinoki. Difference-in-discontinuities: estimation, inference and validity tests.arXiv preprint arXiv:2405.18531, 2024

  57. [57]

    T. W. Price, Y. Dong, and D. Lipovac. isnap: towards intelligent tutoring in novice programming environments. InProceedings of the 2017 ACM SIGCSE Technical Symposium on computer science education, pages 483–488, 2017

  58. [58]

    D. D. Ready, S. G. McCormick, and R. J. Shmoys. The effects of in-school virtual tutoring on student reading development: Evidence from a short-cycle randomized controlled trial.Journal of Education for Students Placed at Risk (JESPAR), pages 1–21, 2026

  59. [59]

    C. D. Robinson, C. Pollard, S. Novicoff, S. White, and S. Loeb. The effects of virtual tutoring on young readers: Results from a randomized controlled trial. Educational Evaluation and Policy Analysis, 47(4):1245–1265, 2025

  60. [60]

    R. D. Roscoe and D. S. McNamara. Writing pal: Feasibility of an intelligent writing strategy tutor in the high school classroom.Journal of Educational Psychology, 105(4):1010, 2013

  61. [61]

    Schonberg

    C. Schonberg. The impact of ixl on math and ela learning in pennsylvania. Technical report, 2025. Retrieved July 6, 2025

  62. [62]

    Shute, S

    V. Shute, S. Lajoie, and K. Gluck. Individualized and group approaches to training.Training and retraining: A handbook for business, industry, government, and the military, pages 171–207, 2000

  63. [63]

    V. J. Shute, G. Smith, R. Kuba, C.-P. Dai, S. Rahimi, Z. Liu, and R. Almond. The design, development, and testing of learning supports for the physics playground game.International Journal of Artificial Intelligence in Education, 31(3):357–379, 2021

  64. [64]

    Steenbergen-Hu and H

    S. Steenbergen-Hu and H. Cooper. A meta-analysis of the effectiveness of intelligent tutoring systems on k–12 students’ mathematical learning.Journal of educational psychology, 105(4):970, 2013

  65. [65]

    D. R. Thomas, J. Lin, E. Gatz, A. Gurung, S. Gupta, K. Norberg, S. E. Fancsali, V. Aleven, L. Branstetter, E. Brunskill, et al. Improving student learning with hybrid human-ai tutoring: A three-study quasi-experimental investigation. InProceedings of the 14th Learning Analytics and Knowledge Conference, pages 404–415, 2024

  66. [66]

    K. VanLehn. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems.Educational psychologist, 46(4):197–221, 2011

  67. [67]

    Waalkens, V

    M. Waalkens, V. Aleven, and N. Taatgen. Does supporting multiple student strategies lead to greater learning and motivation? investigating a source of complexity in the architecture of intelligent tutoring systems.Computers & Education, 60(1):159–171, 2013

  68. [68]

    K. B. Yang, V. Echeverria, Z. Lu, H. Mao, K. Holstein, N. Rummel, and V. Aleven. Pair-up: prototyping human-ai co-orchestration of dynamic transitions between individual and collaborative learning in the classroom. InProceedings of the 2023 CHI conference on human factors in computing systems, pages 1–17, 2023

  69. [69]

    K. B. Yang, V. Echeverria, X. Wang, L. Lawrence, K. Holstein, N. Rummel, and V. Aleven. Exploring policies for dynamically teaming up students through log data simulation.International Educational Data Mining Society, 2021