Recognition: no theorem link
Improving Hybrid Human-AI Tutoring by Differentiating Human Tutor Roles Based on Student Needs
Pith reviewed 2026-05-13 02:47 UTC · model grok-4.3
The pith
Differentiated proactive and reactive human tutor roles in AI-hybrid systems raise time on task 25%, proficiency 36%, and academic growth 61% over AI-only tutoring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a difference-in-discontinuity design, the authors demonstrate that implementing proactive human tutoring for students below the median state test score and reactive support for those at or above leads to substantial improvements over AI-only tutoring. Specifically, there is a 25% increase in time on task, 36% in skill proficiency, and 61% in academic growth as measured by the MAP test. While both approaches improve time and proficiency similarly, proactive tutoring shows a marginal advantage in MAP growth (75% higher, p=0.065), particularly for students below the cutoff, thereby helping to narrow achievement gaps.
What carries the argument
Difference-in-discontinuity (DiDC) design that compares the discontinuity in outcomes around the within-grade median state test score cutoff between the AI-only fall period and the differentiated hybrid spring period.
If this is right
- Hybrid human-AI tutoring with differentiated roles outperforms AI-only tutoring across time on task, skill proficiency, and standardized growth.
- Proactive and reactive human support produce comparable gains in engagement and proficiency for their respective groups.
- Proactive tutoring provides additional MAP growth benefits that help narrow achievement gaps for students below the median.
- The differentiated policy offers a cost-effective model for scaling hybrid tutoring by matching human effort to student need.
Where Pith is reading between the lines
- This cutoff-based assignment could be adapted using local or real-time performance data instead of annual state tests.
- Dynamic switching between proactive and reactive modes might further optimize outcomes if performance changes mid-semester.
- The approach raises the possibility of applying similar differentiation in other AI learning tools, such as adaptive homework platforms.
- Tracking whether the gap-narrowing effect persists across multiple years would test longer-term equity gains.
Load-bearing premise
The within-grade median state test score cutoff cleanly separates students who benefit differently from proactive versus reactive human support, and the DiDC comparison isolates the policy effect without unmeasured time-varying confounders.
What would settle it
Observing no larger discontinuity in spring outcomes around the median cutoff than existed in fall, or finding that the extra MAP growth for proactive tutoring disappears after controlling for seasonal or other factors, would indicate the differentiated policy does not deliver the claimed benefits.
Figures
read the original abstract
Hybrid human-AI tutoring, where technology and humans jointly facilitate student learning, can be more beneficial than AI-only tutoring. However, preliminary evidence suggests that lower-performing students derive greater benefit from human-AI tutoring than higher-performing students. As such, this study evaluates whether a differentiated tutoring policy can effectively support both groups: human tutors initiate support for lower-performing students, while higher-performing students receive reactive, on-demand support. Using their within-grade median state test scores, we assigned 635 students (grades 5-8) to receive proactive (< median) or reactive ($\geq$ median) tutoring. Using a DiDC design, we compare outcomes across two time periods: fall (AI-only tutoring) and spring (proactive-reactive human-AI tutoring). This quasi-experimental design isolates the effects of proactive-reactive tutoring approaches by comparing the discontinuity in spring outcomes to the fall, where no such discontinuity existed. Using data around the cutoff (Imbens-Kalyanaraman criterion), we find significant overall improvements from human-AI tutoring compared to AI-only baseline: 25% increase in time on task, 36% in skill proficiency, and 61% in academic growth (standardized MAP test). Between proactive and reactive tutoring, we find comparable improvements in time-on-task and skill proficiency. However, proactive tutoring, on average, showed marginally higher MAP growth (75%, p = .065) than reactive tutoring, i.e., proactive tutoring was more beneficial to students farther below the cutoff and helped narrow achievement gaps. Our findings provide evidence that differentiated human-AI tutoring addresses the needs of both groups, offering a practical and cost-effective strategy for scaling hybrid instruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates a differentiated hybrid human-AI tutoring policy for 635 students in grades 5-8. Using within-grade median state test scores, students below the median receive proactive human support while those at or above receive reactive support. A difference-in-discontinuity (DiDC) design compares fall (AI-only) to spring (hybrid differentiated) outcomes around the cutoff selected via the Imbens-Kalyanaraman bandwidth. The study reports overall gains versus AI-only baseline of 25% in time on task, 36% in skill proficiency, and 61% in standardized MAP growth. Proactive and reactive approaches yield comparable gains in time-on-task and proficiency, but proactive tutoring shows a marginally higher MAP growth advantage (75%, p=.065), interpreted as narrowing achievement gaps for lower-performing students.
Significance. If the DiDC identification holds, the work provides practical evidence that tailoring human tutor roles in hybrid systems—proactive for lower performers and reactive for higher—can scale hybrid instruction while improving equity and outcomes. The quasi-experimental approach with explicit bandwidth selection is a methodological strength that supports causal claims about policy differentiation. This contributes to human-AI tutoring literature by demonstrating a cost-effective strategy that addresses heterogeneous student needs.
major comments (2)
- The DiDC design attributes spring outcome discontinuities at the median cutoff to the proactive/reactive policy, but this requires that fall-to-spring changes would have been continuous at the cutoff absent the intervention. The manuscript reports the marginal p=.065 result for the proactive MAP advantage without detailing robustness checks (e.g., covariate discontinuity tests, placebo cutoffs, or sensitivity to bandwidth choice) that would support the no time-varying confounder assumption. (DiDC analysis and results)
- The exact sample size of observations within the Imbens-Kalyanaraman bandwidth used for the local DiDC estimation is not reported. This information is load-bearing for assessing statistical power, precision of the 25%/36%/61% gains, and reliability of the p=.065 differential effect. (Results reporting)
minor comments (1)
- Clarify the precise interpretation of the '75%' MAP growth figure for proactive versus reactive tutoring (e.g., relative percentage increase, standardized effect size, or model coefficient). (Abstract and results)
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas for strengthening the causal claims and transparency of our DiDC analysis. We address each major comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: The DiDC design attributes spring outcome discontinuities at the median cutoff to the proactive/reactive policy, but this requires that fall-to-spring changes would have been continuous at the cutoff absent the intervention. The manuscript reports the marginal p=.065 result for the proactive MAP advantage without detailing robustness checks (e.g., covariate discontinuity tests, placebo cutoffs, or sensitivity to bandwidth choice) that would support the no time-varying confounder assumption. (DiDC analysis and results)
Authors: We agree that explicit robustness checks are needed to bolster confidence in the identifying assumption of no time-varying confounders at the cutoff. In the revised manuscript, we will add: (1) covariate discontinuity tests for baseline characteristics (e.g., prior achievement, demographics) to verify balance; (2) placebo tests using alternative cutoffs (such as other within-grade quantiles) where no policy change occurred; and (3) sensitivity analyses to bandwidth choice around the Imbens-Kalyanaraman selection. These will be reported in a new subsection of the results to support the DiDC interpretation, including the marginal proactive MAP growth advantage. revision: yes
-
Referee: The exact sample size of observations within the Imbens-Kalyanaraman bandwidth used for the local DiDC estimation is not reported. This information is load-bearing for assessing statistical power, precision of the 25%/36%/61% gains, and reliability of the p=.065 differential effect. (Results reporting)
Authors: We concur that the precise number of observations within the selected bandwidth is critical for evaluating the estimates. We will report the exact sample sizes (overall and by outcome) used in the local DiDC estimation in the revised results section, along with the corresponding effective sample sizes after applying the Imbens-Kalyanaraman criterion. revision: yes
Circularity Check
No circularity: purely empirical DiDC reporting with no derivation chain
full rationale
The paper presents measured outcomes from a quasi-experimental difference-in-discontinuity (DiDC) design comparing fall AI-only tutoring to spring differentiated hybrid tutoring, using within-grade median state test scores as the cutoff. No equations, fitted parameters, predictions, or self-citations are invoked to derive the central claims; the reported improvements (time on task, skill proficiency, MAP growth) are direct empirical results. The analysis is self-contained against external data benchmarks and does not reduce any result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The DiDC design isolates the causal effect of the proactive-reactive policy by comparing the spring discontinuity to the fall baseline where no policy discontinuity existed.
Reference graph
Works this paper leans on
-
[1]
Improving Hybrid Human-AI Tutoring by Differentiating Human Tutor Roles Based on Student Needs
INTRODUCTION The efficacy of human tutoring is well established in educa- tional research [43], with meta-analyses showing it to be par- ticularly beneficial for lower-performing and socioeconomi- cally disadvantaged students [19, 44]. However, scaling hu- man tutoring remains a persistent challenge due to cost and logistical constraints. Automated altern...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
When calculating the volume of a cylinder, do we use the area or the perimeter of the base?
BACKGROUND This section reviews the current evidence on the effectiveness of human tutoring, AI tutors, and human-AI tutoring. We also examine how researchers and practitioners have sought to scale these interventions. 2.1 Human Tutoring Bloom’s seminal 2-sigma effect [9] highlighted the impressive learning benefits of one-on-one human tutoring (under ide...
-
[3]
PRESENT STUDY Building on evidence that lower-performing students benefit more from additional human tutor support (Section 2.3.2), we propose a proactive-reactive tutoring policy that matches the intensity of human support to student needs. Human tu- tors initiate additional motivational and conceptual support to lower-performing students and reactive, o...
-
[4]
METHOD We use a DiDC design to estimate the effects of introducing human-AI tutoring (RQ1; DiD) and the causal effects of proactive versus reactive tutoring (RQ2; estimated at the median cutoff). 4.1 Dataset The dataset includes assignment logs from IXL (AI tutor) and standardized test scores for 635 students in grades 5–8 at a middle school in a Mid-Atla...
work page 2024
-
[5]
RESULTS The code used for the analysis is available in the anonymous GitHub repository 4. 5.1 Descriptive Statistics The descriptive statistics for student performance are re- ported in Table 2. These include prior year’s state test scores and the winter and spring MAP results. Table 2: The mean and SD of student performance on the state and MAP tests. Gr...
-
[6]
The gray line represents the students’ relative MAP growth (1.3×) during the AI-only baseline period
DISCUSSION Given the heterogeneous treatment effects of human-AI tu- toring, we study a differentiated proactive–reactive tutor- ing policy in which lower-performing students receive tutor- initiated support, whereas higher-performing students re- 5http://tiny.cc/EDM26 Figure 6: Examining the impact of proactive and reactive tutoring on students’ relative...
-
[7]
LIMITA TIONS Our quasi-experimental DiDC design [46] provides evidence for the benefits of personalizing tutoring support, though stronger causal evidence could be achieved from randomized controlled trials. Similarly, our reliance on national growth norms as a reference point, rather than a matched control group, limits causal inference. However, ethical...
-
[8]
FUTURE WORK The absence of Zoom interaction data (actual student–tutor interactions) represents a missed opportunity for deeper anal- ysis, especially for understanding which elements of proac- tive tutoring were effective for student learning. We pri- marily relied on lead tutors and tutor supervisors to en- sure implementation fidelity; better Zoom inte...
-
[9]
CONCLUSION This study explored a key challenge in scaling human-AI tutoring: how to allocate limited human tutor support to effectively meet diverse student learning needs. Motivated by preliminary evidence that lower-performing students de- rive greater benefits from human-AI tutoring than higher- performing peers, we evaluated a differentiated proactive...
-
[10]
ADDITIONAL AUTHORS Additional authors: Alex Houk (ahouk@andrew.cmu.edu), Erin Gatz (egatz@andrew.cmu.edu), and Boyuan (Bill) Guo (boyuang@cmu.edu)
-
[11]
V. Aleven, R. Baraniuk, E. Brunskill, S. Crossley, D. Demszky, S. Fancsali, S. Gupta, K. Koedinger, C. Piech, S. Ritter, et al. Towards the future of ai-augmented human tutoring in math learning. In International Conference on Artificial Intelligence in Education, pages 26–31. Springer, 2023
work page 2023
- [12]
- [13]
-
[14]
P. An, K. Holstein, B. d’Anjou, B. Eggen, and S. Bakker. The ta framework: Designing real-time teaching augmentation for k-12 classrooms. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–17, 2020
work page 2020
-
[15]
J. R. Anderson, A. T. Corbett, K. R. Koedinger, and R. Pelletier. Cognitive tutors: Lessons learned.The journal of the learning sciences, 4(2):167–207, 1995
work page 1995
-
[16]
R. S. Baker. Modeling and understanding students’ off-task behavior in intelligent tutoring systems. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 1059–1068, 2007
work page 2007
-
[17]
M. P. Bhatt, J. Guryan, S. A. Khan, M. LaForest-Tucker, and B. Mishra. Can technology facilitate scale? evidence from a randomized evaluation of high dosage tutoring. Technical report, National Bureau of Economic Research, 2024
work page 2024
-
[18]
J. Bleiberg, C. D. Robinson, E. Bennett, and S. Loeb. The impact of tutor gender match on girls’ stem interest, engagement, and performance. 2025
work page 2025
-
[19]
B. S. Bloom. The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring.Educational researcher, 13(6):4–16, 1984
work page 1984
-
[20]
C. Borchers, A. Gurung, Q. Liu, D. R. Thomas, M. Khalil, and K. R. Koedinger. Brief but impactful: How human tutoring interactions shape engagement in online learning.arXiv preprint arXiv:2601.09994, 2026
-
[21]
C. Borchers, A. Houk, V. Aleven, and K. R. Koedinger. Engagement and learning benefits of goal setting with rewards in human-ai tutoring. In International Conference on Artificial Intelligence in Education, pages 46–59. Springer, 2025
work page 2025
-
[22]
S. Calonico, M. D. Cattaneo, and R. Titiunik. Robust nonparametric confidence intervals for regression-discontinuity designs.Econometrica, 82(6):2295–2326, 2014
work page 2014
-
[23]
M. T. Chi, S. A. Siler, H. Jeong, T. Yamauchi, and R. G. Hausmann. Learning from human tutoring. Cognitive science, 25(4):471–533, 2001
work page 2001
-
[24]
D. R. Chine, C. Brentley, C. Thomas-Browne, J. E. Richey, A. Gul, P. F. Carvalho, L. Branstetter, and K. R. Koedinger. Educational equity through combined human-ai personalization: A propensity matching evaluation. InInternational conference on artificial intelligence in education, pages 366–377. Springer, 2022
work page 2022
-
[25]
S. Copeland, M. A. Cook, A. A. Grant, and S. M. Ross. Randomized-control efficacy study of ixl math in holland public schools.Center for Research and Reform in Education, 2023
work page 2023
-
[26]
A. T. Corbett and J. R. Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction, 4(4):253–278, 1994
work page 1994
-
[27]
M. G. Core, J. D. Moore, and C. Zinn. The role of initiative in tutorial dialogue. In10th Conference of the European Chapter of the Association for Computational Linguistics, 2003
work page 2003
- [28]
-
[29]
J. Dietrichson, M. Bøg, T. Filges, and A.-M. Klint Jørgensen. Academic interventions for elementary and middle school students with low socioeconomic status: A systematic review and meta-analysis.Review of educational research, 87(2):243–282, 2017
work page 2017
- [30]
-
[31]
Y. Fang, Z. Ren, X. Hu, and A. C. Graesser. A meta-analysis of the effectiveness of aleks on learning. Educational Psychology, 39(10):1278–1292, 2019
work page 2019
-
[32]
K. Forbes-Riley and D. J. Litman. Analyzing dependencies between student certainness states and tutor responses in a spoken dialogue corpus. InRecent Trends in Discourse and Dialogue, pages 275–304. Springer, 2008
work page 2008
-
[33]
B. A. Fox.The Human Tutorial Dialogue Project: Issues in the design of instructional systems. CRC Press, 2020
work page 2020
-
[34]
A. Goodman-Bacon. Difference-in-differences with variation in treatment timing.Journal of econometrics, 225(2):254–277, 2021
work page 2021
-
[35]
A. A. Grant, M. A. Cook, and S. M. Ross. The impacts of i-ready personalized instruction on student math achievement.Center for Research and Reform in Education, 2023
work page 2023
-
[36]
A. Gurung, J. Lin, J. Gutterman, D. R. Thomas, A. Houk, S. Gupta, E. Brunskill, L. Branstetter, V. Aleven, and K. Koedinger. Human tutoring improves the impact of ai tutor use on learning outcomes. InInternational Conference on Artificial Intelligence in Education, pages 393–407. Springer, 2025
work page 2025
- [37]
- [38]
- [39]
-
[40]
N. T. Heffernan and C. L. Heffernan. The assistments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching.International Journal of Artificial Intelligence in Education, 24(4):470–497, 2014
work page 2014
-
[41]
K. Holstein, B. M. McLaren, and V. Aleven. Student learning benefits of a mixed-reality teacher awareness tool in ai-enhanced classrooms. InInternational conference on artificial intelligence in education, pages 154–168. Springer, 2018
work page 2018
-
[42]
G. Imbens and K. Kalyanaraman. Optimal bandwidth choice for the regression discontinuity estimator.The Review of economic studies, 79(3):933–959, 2012
work page 2012
-
[43]
G. W. Imbens and T. Lemieux. Regression discontinuity designs: A guide to practice.Journal of econometrics, 142(2):615–635, 2008
work page 2008
-
[44]
Measuring the impact of ixl math and ixl language arts in pennsylvania schools
IXL Learning. Measuring the impact of ixl math and ixl language arts in pennsylvania schools. Technical report, IXL Learning, 2020. Retrieved July 6, 2025
work page 2020
-
[45]
S. Katz, P. Albacete, I.-A. Chounta, P. Jordan, B. M. McLaren, and D. Zapata-Rivera. Linking dialogue with student modelling to create an adaptive tutoring system for conceptual physics.International journal of artificial intelligence in education, 31(3):397–445, 2021
work page 2021
-
[46]
J. A. Kulik and J. D. Fletcher. Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of educational research, 86(1):42–78, 2016
work page 2016
- [47]
-
[48]
J. Lin, Z. Han, D. R. Thomas, A. Gurung, S. Gupta, V. Aleven, and K. R. Koedinger. How can i get it right? using gpt to rephrase incorrect trainee responses.International journal of artificial intelligence in education, 35(2):482–508, 2025
work page 2025
-
[49]
J. McCrary. Manipulation of the running variable in the regression discontinuity design: A density test. Journal of econometrics, 142(2):698–714, 2008
work page 2008
-
[50]
D. S. McNamara, I. B. Levinstein, and C. Boonthum. istart: Interactive strategy training for active reading and thinking.Behavior Research Methods, Instruments, & Computers, 36(2):222–233, 2004
work page 2004
- [51]
-
[52]
J. Morrow and M. Ackermann. Intention to persist and retention of first-year students: The importance of motivation and sense of belonging.College student journal, 46(3):483–491, 2012
work page 2012
-
[53]
K. Muldner, R. Lam, and M. T. Chi. Comparing learning from observing and from human tutoring. Journal of Educational Psychology, 106(1):69, 2014
work page 2014
- [54]
- [55]
-
[56]
P. Picchetti, C. C. Pinto, and S. T. Shinoki. Difference-in-discontinuities: estimation, inference and validity tests.arXiv preprint arXiv:2405.18531, 2024
-
[57]
T. W. Price, Y. Dong, and D. Lipovac. isnap: towards intelligent tutoring in novice programming environments. InProceedings of the 2017 ACM SIGCSE Technical Symposium on computer science education, pages 483–488, 2017
work page 2017
-
[58]
D. D. Ready, S. G. McCormick, and R. J. Shmoys. The effects of in-school virtual tutoring on student reading development: Evidence from a short-cycle randomized controlled trial.Journal of Education for Students Placed at Risk (JESPAR), pages 1–21, 2026
work page 2026
-
[59]
C. D. Robinson, C. Pollard, S. Novicoff, S. White, and S. Loeb. The effects of virtual tutoring on young readers: Results from a randomized controlled trial. Educational Evaluation and Policy Analysis, 47(4):1245–1265, 2025
work page 2025
-
[60]
R. D. Roscoe and D. S. McNamara. Writing pal: Feasibility of an intelligent writing strategy tutor in the high school classroom.Journal of Educational Psychology, 105(4):1010, 2013
work page 2013
- [61]
- [62]
-
[63]
V. J. Shute, G. Smith, R. Kuba, C.-P. Dai, S. Rahimi, Z. Liu, and R. Almond. The design, development, and testing of learning supports for the physics playground game.International Journal of Artificial Intelligence in Education, 31(3):357–379, 2021
work page 2021
-
[64]
S. Steenbergen-Hu and H. Cooper. A meta-analysis of the effectiveness of intelligent tutoring systems on k–12 students’ mathematical learning.Journal of educational psychology, 105(4):970, 2013
work page 2013
-
[65]
D. R. Thomas, J. Lin, E. Gatz, A. Gurung, S. Gupta, K. Norberg, S. E. Fancsali, V. Aleven, L. Branstetter, E. Brunskill, et al. Improving student learning with hybrid human-ai tutoring: A three-study quasi-experimental investigation. InProceedings of the 14th Learning Analytics and Knowledge Conference, pages 404–415, 2024
work page 2024
-
[66]
K. VanLehn. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems.Educational psychologist, 46(4):197–221, 2011
work page 2011
-
[67]
M. Waalkens, V. Aleven, and N. Taatgen. Does supporting multiple student strategies lead to greater learning and motivation? investigating a source of complexity in the architecture of intelligent tutoring systems.Computers & Education, 60(1):159–171, 2013
work page 2013
-
[68]
K. B. Yang, V. Echeverria, Z. Lu, H. Mao, K. Holstein, N. Rummel, and V. Aleven. Pair-up: prototyping human-ai co-orchestration of dynamic transitions between individual and collaborative learning in the classroom. InProceedings of the 2023 CHI conference on human factors in computing systems, pages 1–17, 2023
work page 2023
-
[69]
K. B. Yang, V. Echeverria, X. Wang, L. Lawrence, K. Holstein, N. Rummel, and V. Aleven. Exploring policies for dynamically teaming up students through log data simulation.International Educational Data Mining Society, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.