arxiv: 2605.05648 · v1 · submitted 2026-05-07 · 💻 cs.CY · cs.AI· cs.HC

Recognition: unknown

The Missing Evaluation Axis: What 10,000 Student Submissions Reveal About AI Tutor Effectiveness

Arto Hellas, Bita Akram, John DeNero, Juho Leinonen, Narges Norouzi, Peter Brusilovsky, Rose Niousha, Samantha Boatright Smith

Pith reviewed 2026-05-08 05:12 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.HC

keywords AI tutorsevaluation frameworkstudent engagementfeedback effectivenessprogramming educationbehavioral analysisintroductory computer science

0 comments

The pith

AI tutor evaluations must track whether students actually apply the feedback, as behavioral signals from code changes reveal more about effectiveness than pedagogy scores alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluations of AI tutors center on the pedagogical quality of feedback messages, yet this ignores what students do with that advice. The paper introduces a behavioral evaluation axis that examines code submissions before and after feedback to determine if students act on the suggestions and apply them correctly. Analysis of 10,235 submissions from an introductory programming course, comparing two AI tutors across semesters, uncovers engagement differences invisible to pedagogy-only methods. These behavioral signals also correlate more strongly with students' own ratings of feedback helpfulness than traditional quality assessments do.

Core claim

The paper establishes that extending AI tutor evaluation with a behavioral dimension—measuring whether students modify their code after receiving feedback and whether those modifications correctly implement the tutor's advice—reveals substantial differences between tutors that pedagogy alone misses. When applied to real course data, this combined approach shows engagement patterns align better with student perceptions of helpfulness, yielding a more complete view of tutor performance than quality ratings in isolation.

What carries the argument

The proposed evaluation framework that analyzes pre- and post-feedback code submissions to quantify student engagement and correct application of tutor advice.

If this is right

Comparisons of AI tutors must incorporate both pedagogical and engagement metrics to avoid incomplete conclusions.
Student ratings of feedback helpfulness track more closely with observed code changes than with expert pedagogy scores.
Deployed tutors can be differentiated by how well they prompt visible student action on suggestions.
Large interaction datasets enable identification of feedback that students ignore despite high pedagogical quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Course designers could prioritize tutors that maximize observable engagement when choosing tools for large classes.
Similar behavioral tracking might apply to other AI-assisted learning domains where user actions after advice can be logged.
Optimizing tutors for both accurate advice and prompts that increase correct application could improve overall outcomes.

Load-bearing premise

That observed code changes after feedback mean students understood and correctly applied the advice rather than making unrelated edits or receiving help elsewhere.

What would settle it

If a large share of post-feedback code changes turn out to be incorrect, random, or uncorrelated with the specific advice given, the claim that behavioral signals reliably indicate effective tutor use would be undermined.

Figures

Figures reproduced from arXiv: 2605.05648 by Arto Hellas, Bita Akram, John DeNero, Juho Leinonen, Narges Norouzi, Peter Brusilovsky, Rose Niousha, Samantha Boatright Smith.

**Figure 1.** Figure 1: Illustration of the proposed AI tutor evaluation framework. Left: The current evaluation framework positions tutors only along the pedagogical quality axis; as long as pedagogical quality is high, tutors appear comparable in performance. Right: The proposed framework reveals the two-dimensional position of each tutor, capturing both pedagogical quality and student engagement with feedback, providing greate… view at source ↗

**Figure 2.** Figure 2: Per-assignment differences in DAMR between MisconceptionTutor and BaselineTutor. Positive values indicate higher DAMR for MisconceptionTutor. tutor_tone is omitted, as it shows no difference between tutors, with 100% DAMR for both. Each model estimates the probability of student-perceived helpfulness as Pr(y = 1) = σ(x ⊤β), where x denotes the covariates corresponding to each model specification. Coeffi… view at source ↗

**Figure 3.** Figure 3: Distribution of RelScore and SuccScore for desired (green) versus undesired (red) feedback messages, shown separately for BaselineTutor (Fall 2024) and MisconceptionTutor (Fall 2025). Violin widths are scaled within each pedagogical dimension and within each subplot to enable direct comparison between desired and undesired feedback without confounding differences in sample size across dimensions. Black li… view at source ↗

read the original abstract

Current Artificial Intelligence (AI)-based tutoring systems (AI tutors) are primarily evaluated based on the pedagogical quality of their feedback messages. While important, pedagogy alone is insufficient because it ignores a critical question: what do students actually do with the feedback they receive? We argue that AI tutor evaluation should be extended with a behavioral dimension grounded in student interaction data, which complements pedagogical assessment. We propose an evaluation framework and apply it to 10,235 code submissions with corresponding AI tutor feedback from an introductory undergraduate programming course to measure whether students act on tutor feedback and whether those actions are applied correctly. Using this framework to compare two deployed AI tutors across different semesters in a large-scale introductory computer science course reveals substantial differences in student engagement patterns that are not captured by pedagogy-only evaluation. Moreover, these engagement-based behavioral signals are more strongly associated with student perception of helpful feedback than pedagogical quality alone, providing a more complete and actionable picture of AI tutor performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The behavioral axis adds a practical layer to AI tutor evaluation using real submission data, but the code-change-to-uptake link needs tighter checks.

read the letter

The main point is that this paper demonstrates how behavioral signals from student code edits after AI tutor feedback can highlight differences between tutors that pedagogy scores alone overlook, and these signals line up better with student ratings of helpfulness. They ran the analysis on 10,235 submissions from an intro CS course, comparing two deployed tutors across semesters. The framework tracks whether students modified their code post-feedback and whether the change correctly applied the advice, then shows engagement patterns and perception correlations that the standard approach misses. That is a straightforward, data-grounded step forward for anyone evaluating tutoring tools in programming classes. The real-classroom scale and direct comparison give the results some weight without relying on simulations. The soft spot sits in the inference step. Detecting code changes and labeling them as correct application of feedback assumes those edits stem from the tutor's message rather than independent debugging, external resources, or peer input. The abstract does not detail validation steps, inter-rater checks, edit-distance thresholds, or timing controls, so the claim that behavioral measures are superior rests on an untested mapping. If that assumption slips, the headline advantage shrinks. This work is for researchers and tool builders focused on AI tutors in computer science education. Anyone looking for concrete ways to move beyond message-quality rubrics will find usable ideas here. The thinking is clear and tied to actual logs rather than abstract models. Send it to peer review. The data volume and the practical framing make it worth referee attention, even if the methods section needs expansion on how they ruled out confounders.

Referee Report

1 major / 2 minor

Summary. The paper claims that standard pedagogy-only evaluation of AI tutors is insufficient and should be extended with a behavioral axis measuring whether students act on feedback (via post-feedback code modifications) and apply it correctly. Analysis of 10,235 submissions from an introductory CS course shows substantial engagement differences between two deployed tutors not visible in pedagogy scores; moreover, these behavioral signals correlate more strongly with student-reported helpfulness than pedagogy alone.

Significance. If the behavioral metric holds after validation, the work supplies a practical, data-grounded complement to existing evaluation that directly links tutor output to observable student uptake, offering clearer guidance for iterative improvement of deployed AI tutors.

major comments (1)

[Framework / Methods] Framework / Methods (as described in abstract and skeptic note): The central inference that detected code changes after feedback reliably indicate students have correctly applied the tutor's advice is load-bearing for both the tutor-comparison result and the stronger-correlation claim, yet the manuscript provides no inter-rater reliability, ablation on edit-distance thresholds, timing controls, or manual ground-truth validation to rule out independent debugging, external help, or unrelated edits. Without these, the behavioral signals risk capturing noise rather than genuine uptake.

minor comments (2)

[Abstract] Abstract: Exact correlation coefficients, p-values, and the precise statistical controls used for the perception association are not stated, preventing readers from judging effect size and robustness.
[Methods] The manuscript should clarify the exact operational definition of 'correctly implement the advice' (e.g., required edit distance, semantic matching rules) and any data-exclusion criteria applied to the 10,235 submissions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for stronger validation of the behavioral signals. We address the concern directly below.

read point-by-point responses

Referee: The central inference that detected code changes after feedback reliably indicate students have correctly applied the tutor's advice is load-bearing for both the tutor-comparison result and the stronger-correlation claim, yet the manuscript provides no inter-rater reliability, ablation on edit-distance thresholds, timing controls, or manual ground-truth validation to rule out independent debugging, external help, or unrelated edits. Without these, the behavioral signals risk capturing noise rather than genuine uptake.

Authors: We agree that the absence of explicit validation leaves the behavioral metric open to the alternative explanations noted. The current framework detects post-feedback edits via normalized edit distance and classifies application correctness by alignment with the specific suggestions in the tutor message (e.g., insertion of a required construct or correction of an identified error). To strengthen this, the revised manuscript will include: (1) manual ground-truth labeling of a stratified random sample of 200 submissions by two independent raters, with reported Cohen's kappa; (2) an ablation table varying the edit-distance threshold and showing stability of the tutor-comparison and correlation results; (3) a timing analysis that restricts the behavioral signal to edits occurring within a short window after feedback and compares it to longer windows to assess contamination by independent debugging. While the observed correlation with student-reported helpfulness already offers convergent validity, these additions will directly address the risk of noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's evaluation framework is constructed from direct empirical measurements on 10,235 code submissions: behavioral signals are defined via observable post-feedback code modifications, then correlated against independent student perception surveys and pedagogy scores. No step reduces a claimed result to its own inputs by construction, renames a fitted parameter as a prediction, or relies on load-bearing self-citations whose content is unverified. The association between behavioral signals and perceived helpfulness is a data-driven finding rather than a definitional tautology. The derivation chain remains self-contained against the provided submission logs and survey responses.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about what constitutes 'acting on feedback' and 'applying it correctly' from code edit patterns, plus the assumption that student perception surveys are a valid external validator. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Code submission changes following tutor feedback indicate that students have acted on the feedback.
Used to define the behavioral engagement measure in the proposed framework.
domain assumption Whether those changes are correct can be determined from the resulting code state.
Required to distinguish productive from unproductive actions on feedback.

pith-pipeline@v0.9.0 · 5497 in / 1370 out tokens · 49461 ms · 2026-05-08T05:12:43.588855+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 2 canonical work pages

[1]

Assessment & Evaluation in Higher Education43(8) (2018)

Carless, D., Boud, D.: The development of student feedback literacy: enabling uptake of feedback. Assessment & Evaluation in Higher Education43(8) (2018)

2018
[2]

Review of educational research74(1) (2004)

Fredricks, J.A., Blumenfeld, P.C., Paris, A.H.: School engagement: Potential of the concept, state of the evidence. Review of educational research74(1) (2004)

2004
[3]

In: International Conference on Artificial Intelligence in Education

Gupta, A., Reddig, J., Calo, T., Weitekamp, D., MacLellan, C.J.: Beyond final answers: Evaluating large language models for math tutoring. In: International Conference on Artificial Intelligence in Education. Springer (2025)

2025
[4]

Feedback in second language writing: Contexts and issues pp

Han, Y., Hyland, F.: Learner engagement with written feedback: A sociocognitive perspective. Feedback in second language writing: Contexts and issues pp. 247–264 (2019)

2019
[5]

Computers & Education90(2015)

Henrie, C.R., Halverson, L.R., Graham, C.R.: Measuring student engagement in technology-mediated learning: A review. Computers & Education90(2015)

2015
[6]

In: Lipnevich, A.A., Smith, J.K

Jonsson, A., Panadero, E.: Facilitating students’ active engagement with feed- back. In: Lipnevich, A.A., Smith, J.K. (eds.) The Cambridge Handbook of In- structional Feedback. Cambridge Handbooks in Psychology, Cambridge University Press (2018)

2018
[7]

McKee, Daniel Gillick, et al

Jurenka, I., Kunesch, M., McKee, K.R., Gillick, D., Zhu, S., Wiltberger, S., Phal, S.M., Hermann, K., Kasenberg, D., Bhoopchand, A., et al.: Towards responsible development of generative AI for education: An evaluation-driven approach. arXiv preprint arXiv:2407.12687 (2024)

work page arXiv 2024
[8]

In: International Conference on Computers in Education (2025)

Li, R., Jiang, Y.H., Wang, J., Jiang, B.: How real is AI tutoring? Comparing simu- lated and human dialogues in one-on-one instruction. In: International Conference on Computers in Education (2025)

2025
[9]

In: Proceedings of the 55th ACM technical symposium on computer science education V

Liu, R., Zenke, C., Liu, C., Holmes, A., Thornton, P., Malan, D.J.: Teaching CS50 with AI: leveraging generative artificial intelligence in computer science educa- tion. In: Proceedings of the 55th ACM technical symposium on computer science education V. 1 (2024)

2024
[10]

PeerJ Computer Science11(2025)

Liu, Z., Agrawal, P., Singhal, S., Madaan, V., Kumar, M., Verma, P.K.: LPITutor: an LLM based personalized intelligent tutoring system using RAG and prompt engineering. PeerJ Computer Science11(2025)

2025
[11]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (2025)

Macina, J., Daheim, N., Hakimi, I., Kapur, M., Gurevych, I., Sachan, M.: Mathtu- torbench: A benchmark for measuring open-ended pedagogical capabilities of LLM tutors. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (2025)

2025
[12]

In: Vlachos, A., Au- genstein, I

Macina, J., Daheim, N., Wang, L., Sinha, T., Kapur, M., Gurevych, I., Sachan, M.: Opportunities and challenges in neural dialog tutoring. In: Vlachos, A., Au- genstein, I. (eds.) Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Dubrovnik, Croatia (May 2023)

2023
[13]

Educational Psychology Review35(2) (2023)

Martin, A.J.: Integrating motivation and instruction: Towards a unified approach in educational psychology. Educational Psychology Review35(2) (2023)

2023
[14]

In: International Conference on Artificial Intelligence in Education

Maurya, K.K., Kochmar, E.: Pedagogy-driven evaluation of generative AI-Powered intelligent tutoring systems. In: International Conference on Artificial Intelligence in Education. Springer (2025)

2025
[15]

In: Chiruzzo, L., Ritter, A., Wang, L

Maurya, K.K., Srivatsa, K.A., Petukhova, K., Kochmar, E.: Unifying AI tutor evaluation: An evaluation taxonomy for pedagogical ability assessment of LLM- powered AI tutors. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association The Missing Evaluation Axis 15 for Computatio...

2025
[16]

Educational psychologist50(1) (2015)

Miller, B.W.: Using reading times and eye-movements to measure cognitive en- gagement. Educational psychologist50(1) (2015)

2015
[17]

In: International Conference on Artificial Intelligence in Education

Mittal, M., Tyagi, G., Bailey, A., Ranade, G., Norouzi, N.: Askademia: A real- time AI system for automatic responses to student questions. In: International Conference on Artificial Intelligence in Education. Springer (2025)

2025
[18]

In: Proceedings of the 57th ACM Technical Symposium on Computer Science Education V

Niousha, R., Boatright Smith, S., O’Neill, A., Zamfirescu-Pereira, J., DeNero, J., Norouzi, N.: Misconception-aware LLM programming tutor: Lessons learned from student-tutor interactions. In: Proceedings of the 57th ACM Technical Symposium on Computer Science Education V. 2 (2026)

2026
[19]

OpenAI: GPT-4 (2023)

2023
[20]

OpenAI: Introducing GPT-4.1 in the API (2025)

2025
[21]

Journal of Educational Psychology108(8), 1098 (2016)

Patchan, M.M., Schunn, C.D., Correnti, R.J.: The nature of feedback: How peer feedback features affect students’ implementation rate and quality of revisions. Journal of Educational Psychology108(8), 1098 (2016)

2016
[22]

In: Proceed- ings of the ACM Global on Computing Education Conference 2025 Vol 1 (2025)

Qi, L., Zamfirescu-Pereira, J., Kim, T., Hartmann, B., DeNero, J., Norouzi, N.: A knowledge-component-based methodology for evaluating ai assistants. In: Proceed- ings of the ACM Global on Computing Education Conference 2025 Vol 1 (2025)

2025
[23]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)

Ross, A., Andreas, J.: Toward in-context teaching: Adapting examples to students’ misconceptions. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)

2024
[24]

In- structional science18(2) (1989)

Sadler, D.R.: Formative assessment and the design of instructional systems. In- structional science18(2) (1989)

1989
[25]

Srinivasa,R.S.,Che,Z.,Zhang,C.B.C.,Mares,D.,Hernandez,E.,Park,J.,Lee,D., Mangialardi, G., Ng, C., Cardona, E.Y.H., et al.: Tutorbench: A benchmark to as- sesstutoringcapabilitiesoflargelanguagemodels.arXivpreprintarXiv:2510.02663 (2025)

work page arXiv 2025
[26]

In: Proceedings of the 15th Inter- national Conference on Educational Data Mining

Tack, A., Piech, C.: The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues. In: Proceedings of the 15th Inter- national Conference on Educational Data Mining. International Educational Data Mining Society, Durham, United Kingdom (July 2022)

2022
[27]

Educational Research for Policy and Practice21(3) (2022)

Tay, H.Y., Lam, K.W.: Students’ engagement across a typology of teacher feedback practices. Educational Research for Policy and Practice21(3) (2022)

2022
[28]

Siddiqui, M., J

Weitekamp, D., N. Siddiqui, M., J. MacLellan, C.: Tutorgym: A testbed for eval- uating AI agents as tutors and students. In: International Conference on Artificial Intelligence in Education. Springer (2025)

2025
[29]

Frontiers in psychology10(2020)

Wisniewski, B., Zierer, K., Hattie, J.: The power of feedback revisited: A meta- analysis of educational feedback research. Frontiers in psychology10(2020)

2020
[30]

Educational Psychology Re- view34(1) (2022)

Wong, Z.Y., Liem, G.A.D.: Student engagement: Current state of the construct, conceptual refinement, and future research directions. Educational Psychology Re- view34(1) (2022)

2022
[31]

Journal of Educational Psychology116(1) (2024)

Wong, Z.Y., Liem, G.A.D., Chan, M., Datu, J.A.D.: Student engagement and its association with academic achievement and subjective well-being: A systematic review and meta-analysis. Journal of Educational Psychology116(1) (2024)

2024
[32]

(Now what?)

Zamfirescu-Pereira, J., Qi, L., Hartmann, B., DeNero, J., Norouzi, N.: 61A bot report: AI assistants in CS1 save students homework time and reduce demands on staff. (Now what?). In: Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 (2025)

2025