Using Learning Progressions to Guide AI Feedback for Science Learning
Pith reviewed 2026-05-15 16:34 UTC · model grok-4.3
The pith
Learning progression-based rubrics produce AI feedback for science explanations as effectively as expert-designed rubrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An LP-driven rubric generation pipeline produced AI feedback on students' scientific explanations that was statistically indistinguishable from feedback guided by a human expert-designed task-specific rubric, as measured across multiple quality dimensions by human coders.
What carries the argument
The learning progression-driven rubric pipeline, which automatically derives task-specific rubrics from a learning progression prior to generating feedback.
If this is right
- The LP approach can serve as a scalable alternative to expert rubric authoring for AI feedback systems.
- Feedback quality remains comparable in clarity, relevance, engagement, and reflectiveness.
- This method may reduce the time and expertise required to create feedback rubrics for new instructional contexts.
- The pipeline maintains high inter-rater reliability in evaluations.
Where Pith is reading between the lines
- This could enable faster adaptation of AI tutoring systems across different science topics without repeated expert input.
- Future work might test whether the same LP-derived rubrics improve actual student learning outcomes, not just feedback quality ratings.
- The approach may generalize to other subjects if learning progressions are available.
Load-bearing premise
The multi-dimensional rubric used to evaluate the feedback fully captures what makes feedback educationally valuable, and the learning progression accurately maps student understanding in the chemistry task.
What would settle it
A study where students receiving LP-guided feedback show significantly different learning gains compared to those receiving expert-rubric feedback, as measured by pre-post tests on the chemistry concepts.
Figures
read the original abstract
Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that an LP-driven rubric generation pipeline produces AI feedback for middle-school chemistry explanations comparable in quality to expert-authored rubric feedback, as evidenced by no statistically significant differences in paired t-tests on human-coded dimensions (Clarity, Relevance, Engagement/Motivation, Reflectiveness) for 207 student responses, with high inter-rater reliability.
Significance. If the central comparison holds after addressing gaps in validation and outcome measures, the work would demonstrate a scalable, theory-grounded alternative to expert rubric authoring for AI formative feedback in science education, potentially reducing development costs while preserving quality.
major comments (2)
- [Abstract] Abstract: the claim that the LP pipeline 'can serve as an alternative solution' rests on coder ratings alone; no pre/post learning gains, explanation improvement scores, or student outcome data are reported for the 207 explanations, leaving the educational effectiveness of the equivalence untested.
- [Abstract] Abstract and methods description: the automatic derivation of the task-specific rubric from the LP is described at high level only, with no reported validation that the resulting dimensions align with actual student misconception trajectories or empirical data in this chemistry domain.
minor comments (1)
- [Abstract] Abstract: provide the exact sample selection criteria and full statistical reporting (e.g., effect sizes, confidence intervals) beyond the listed t and p values to allow independent verification.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below, clarifying the study's scope while planning targeted revisions to improve transparency and detail.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the LP pipeline 'can serve as an alternative solution' rests on coder ratings alone; no pre/post learning gains, explanation improvement scores, or student outcome data are reported for the 207 explanations, leaving the educational effectiveness of the equivalence untested.
Authors: We agree that the study evaluates feedback quality via human coder ratings rather than direct student learning outcomes. The primary objective was to test equivalence in feedback quality between the two pipelines as a necessary first step before outcome studies. No pre/post gains or explanation improvement scores were collected. In revision we will update the abstract and add an explicit limitations paragraph in the discussion that states this scope limitation and outlines planned future work measuring educational effectiveness through student outcome data. revision: partial
-
Referee: [Abstract] Abstract and methods description: the automatic derivation of the task-specific rubric from the LP is described at high level only, with no reported validation that the resulting dimensions align with actual student misconception trajectories or empirical data in this chemistry domain.
Authors: The methods section describes the pipeline at a summary level. We will expand it with a step-by-step account of how LP levels are mapped to rubric dimensions and criteria. The underlying LP draws on prior empirical work in chemistry, yet the current study does not include new domain-specific validation of the derived dimensions against misconception trajectories. We will add a brief discussion of this point and note it as an area for future empirical validation. revision: partial
Circularity Check
No circularity: empirical comparison with external human ratings
full rationale
The paper reports an empirical study that generates AI feedback via two pipelines (expert rubric vs. LP-derived rubric), collects human coder ratings on multiple dimensions, computes inter-rater reliability, and performs paired t-tests showing no significant differences. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. The central claim rests on external human evaluation data and standard statistical tests rather than any reduction of outputs to inputs by construction. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Learning progressions provide a valid, task-specific representation of developing student understanding that can be converted into rubrics.
- domain assumption The multi-dimensional human evaluation rubric measures the educational effectiveness of feedback.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity... These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.
-
IndisputableMonolith.CostJcost unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Learning progressions (LP) provide a theoretically grounded representation of students’ developing understanding
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Using Learning Progressions to Guide AI Feedback for Science Learning Xin Xia1[0009-0009-1717-8511], Nejla Yuruk 2[0000-0001-9240-750X], Yun Wang1[0009-0004-6611-0752] and Xiaoming Zhai1[0000-0003-4519-1931] 1 University of Georgia, Athens GA 30605, USA 2 Gazi University, Ankara, 06560, Turkiye xx86245@uga.com Abstract. Generative artificial intelligence ...
work page 1931
-
[2]
describes an LLM-assisted feedback tool in which instructor-defined criteria are used to guide automated responses to open-ended questions, em-phasizing that criteria-based scaffolding supports more targeted and useful feedback. At the same time, empirical comparisons caution that generative AI feedback may fall short of human feedback on criteria-based q...
work page 2013
-
[3]
Learning Progression for Evidence-based Explanations 3.3 AI Feedback Generation Feedback in both pipelines was generated using the same large language model (GPT-5.1). For each student response, the model was prompted to (a) evaluate the response according to the provided rubric and (b) generate formative feedback aligned with the evaluation. Feedback was...
work page 2025
-
[4]
Feedback Evaluation T-test. Sub-Dimension Rubric M (SD) LP M (SD) t df p Clarity – Language 2.00 (0.07) 2.00 (0.07) 0.00 206 1.000 Clarity – Structure 1.87 (0.33) 1.85 (0.36) 0.84 206 0.399 Accuracy – Correctness 2.00 (0.00) 2.00 (0.00) — 206 — Accuracy – Terminology 2.00 (0.00) 2.00 (0.00) — 206 — Relevance – Responsiveness 1.97 (0.18) 1.96 (0.19) 0.28 2...
work page 2020
-
[5]
McNeill, K.L., Krajcik, J.S.: Supporting Grade 5-8 Students in Constructing Explanations in Science: The Claim, Evidence, and Reasoning Framework for Talk and Writing. Pear-son (2011)
work page 2011
-
[6]
Shute, V.J.: Focus on Formative Feedback. Rev. Educ. Res. 78, 153–189 (2008). https://doi.org/10.3102/0034654307313795
-
[7]
Hattie, J., Timperley, H.: The Power of Feedback. Rev. Educ. Res. 77, 81–112 (2007). https://doi.org/10.3102/003465430298487
-
[8]
Abar, R.O., Pong, M., Som, R.: AI-Driven Feedback Systems for Formative Assessment: Toward Personalized and Real-Time Pedagogy. Al-Hijr J. Adulearn World. 4, 87–100 (2025). https://doi.org/10.55849/alhijr.v4i2.984
-
[9]
Deepshikha, D.: A systematic review on the future of educational assessment: AI-driven grading and personalised feedback in higher education. Artif. Intell. Educ. 1–41 (2025). https://doi.org/10.1108/AIIE-03-2025-0036
-
[10]
Bulut, O., Wongvorachan, T.: Feedback Generation through Artificial Intelligence. Open-Technology Educ. Soc. Scholarsh. Assoc. Conf. 2, 1–9 (2022). https://doi.org/10.18357/otessac.2022.2.1.125
-
[11]
VanLehn, K.: The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Sys-tems, and Other Tutoring Systems. Educ. Psychol. 46, 197–221 (2011). https://doi.org/10.1080/00461520.2011.611369
-
[12]
In: 2024 6th International Workshop on Artificial Intelli-gence and Education (WAIE)
Lai, P., Lau, I., Pang, R.: Exploring the Efficacy of Rubric-Based AI Feedback in Enhanc-ing Student Writing Outcomes. In: 2024 6th International Workshop on Artificial Intelli-gence and Education (WAIE). pp. 220–224 (2024). https://doi.org/10.1109/WAIE63876.2024.00047
-
[13]
Pan, Y.: Leveraging generative AI powered rubric-indexed feedback as a formative as-sessment strategy for enhancing medical English education. Discov. Comput. 28, 284 (2025). https://doi.org/10.1007/s10791-025-09830-9
-
[14]
Brookhart, S.M.: Appropriate Criteria: Key to Effective Rubrics. Front. Educ. 3, (2018). https://doi.org/10.3389/feduc.2018.00022
-
[15]
Panadero, E., Andrade, H., Brookhart, S.: Fusing self-regulated learning and formative as-sessment: a roadmap of where we are, how we got here, and where we are going. Aust. Educ. Res. 45, 13–31 (2018). https://doi.org/10.1007/s13384-018-0258-y
-
[16]
Alonzo, A.C., Steedle, J.T.: Developing and assessing a force and motion learning pro-gression. Sci. Educ. 93, 389–421 (2009). https://doi.org/10.1002/sce.20303
-
[17]
Duncan, R.G., Hmelo-Silver, C.E.: Learning progressions: Aligning curriculum, instruc-tion, and assessment. J. Res. Sci. Teach. 46, 606–609 (2009). https://doi.org/10.1002/tea.20316
-
[18]
Bell, B., Cowie, B.: The characteristics of formative assessment in science education. Sci. Educ. 85, 536–553 (2001). https://doi.org/10.1002/sce.1022. 14 X.Xin1 et al
-
[19]
Black, P.: Assessment and feedback in science education. Stud. Educ. Eval. 21, 257–279 (1995). https://doi.org/10.1016/0191-491X(95)00015-M
-
[20]
Koedinger, K.R., D’Mello, S., McLaughlin, E.A., Pardos, Z.A., Rosé, C.P.: Data mining and education. WIREs Cogn. Sci. 6, 333–353 (2015). https://doi.org/10.1002/wcs.1350
-
[21]
Lee, H.-S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., Liu, O.L.: Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Sci. Educ. 103, 590–622 (2019). https://doi.org/10.1002/sce.21504
-
[22]
Jescovitch, L.N., Scott, E.E., Cerchiara, J.A., Merrill, J., Urban-Lurain, M., Doherty, J.H., Haudek, K.C.: Comparison of Machine Learning Performance Using Analytic and Holis-tic Coding Approaches Across Constructed Response Assessments Aligned to a Science Learning Progression. J. Sci. Educ. Technol. 30, 150–167 (2021). https://doi.org/10.1007/s10956-02...
-
[23]
Kaldaras, L., Haudek, K., Krajcik, J.: Employing automatic analysis tools aligned to learn-ing progressions to assess knowledge application and support learning in STEM. Int. J. STEM Educ. 11, 57 (2024). https://doi.org/10.1186/s40594-024-00516-0
-
[24]
Jukiewicz, M., Wyrwa, M.: Can ChatGPT Replace the Teacher in Assessment? A Review of Research on the Use of Large Language Models in Grading and Providing Feedback. Appl. Sci. 16, (2026). https://doi.org/10.3390/app16020680
-
[25]
Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., War-schauer, M., Olson, C.B.: Comparing the quality of human and ChatGPT feedback of stu-dents’ writing. Learn. Instr. 91, 101894 (2024). https://doi.org/10.1016/j.learnin-struc.2024.101894
-
[26]
Liu, O.L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., Linn, M.C.: Automated Scor-ing of Constructed-Response Science Items: Prospects and Obstacles. Educ. Meas. Issues Pract. 33, 19–28 (2014). https://doi.org/10.1111/emip.12028
-
[27]
https://doi.org/10.48550/arXiv.2308.02439
Matelsky, J.K., Parodi, F., Liu, T., Lange, R.D., Kording, K.P.: A large language model-assisted education tool to provide feedback on open-ended responses, http://arxiv.org/abs/2308.02439, (2023). https://doi.org/10.48550/arXiv.2308.02439
-
[28]
Council, N.R., Education, D. of B. and S.S. and, Education, B. on S., Standards, C. on a C.F. for N.K.-12 S.E.: A Framework for K-12 Science Education: Practices, Crosscutting Concepts, and Core Ideas. National Academies Press (2012)
work page 2012
-
[29]
Wilson, M.: Measuring progressions: Assessment structures underlying a learning pro-gression. J. Res. Sci. Teach. 46, 716–730 (2009). https://doi.org/10.1002/tea.20318
-
[30]
Jin, H., Lima, C., Wang, L.: Automated Scoring in Learning Progression-Based Assess-ment: A Comparison of Researcher and Machine Interpretations. Educ. Meas. Issues Pract. 44, 25–37 (2025). https://doi.org/10.1111/emip.70003
-
[31]
Kaldaras, L., Haudek, K.C.: Validation of automated scoring for learning progression-aligned Next Generation Science Standards performance assessments. Front. Educ. 7, (2022). https://doi.org/10.3389/feduc.2022.968289
-
[32]
NSTA Press, National Science Teaching Association (2024)
Harris, C.J., Krajcik, J.S., Pellegrino, J.W.: Creating and using instructionally supportive assessments in NGSS classrooms. NSTA Press, National Science Teaching Association (2024)
work page 2024
-
[33]
Zhai, X., He, P., Krajcik, J.: Applying machine learning to automatically assess scientific models. J. Res. Sci. Teach. 59, 1765–1794 (2022). https://doi.org/10.1002/tea.21773. Are Task-Specific Human Rubrics Necessary for AI-Generated Feedback 15
-
[34]
Gotwals, A.W., Songer, N.B.: Validity Evidence for Learning Progression-Based Assess-ment Items That Fuse Core Disciplinary Ideas and Science Practices. J. Res. Sci. Teach. 50, 597–626 (2013). https://doi.org/10.1002/tea.21083
-
[35]
He, P., Shin, N., Zhai, X., Krajcik, J.: A Design Framework for Integrating Artificial Intel-ligence to Support Teachers’ Timely Use of Knowledge-in-Use Assessments. (2023). https://doi.org/10.13140/RG.2.2.19088.58881
-
[36]
Watts, F.M., Liu, L., Ober, T.M., Song, Y., Valle, E.J.-D., Zhai, X., Wang, Y., Liu, N.: A Framework for Designing an AI Chatbot to Support Scientific Argumentation. Educ. Sci. 15, (2025). https://doi.org/10.3390/educsci15111507
-
[37]
Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 20, 37–46 (1960). https://doi.org/10.1177/001316446002000104
-
[38]
McHugh, M.L.: Interrater reliability: the kappa statistic. Biochem. Medica. 22, 276–282 (2012)
work page 2012
-
[39]
Li, M., Gao, Q., Yu, T.: Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters. BMC Cancer. 23, 799 (2023). https://doi.org/10.1186/s12885-023-11325-z
-
[40]
Zhu, M., Liu, O.L., Lee, H.-S.: The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Comput. Educ. 143, 103668 (2020). https://doi.org/10.1016/j.compedu.2019.103668
-
[41]
Nazaretsky, T., Gabbay, H., Käser, T.: Can students judge like experts? A large-scale study on the pedagogical quality of AI and human personalized formative feedback. Com-put. Educ. Artif. Intell. 10, 100533 (2026). https://doi.org/10.1016/j.caeai.2025.100533
-
[42]
Zhang, D.-W., Boey, M., Tan, Y.Y., Jia, A.H.S.: Evaluating large language models for criterion-based grading from agreement to consistency. NPJ Sci. Learn. 9, 79 (2024). https://doi.org/10.1038/s41539-024-00291-1
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.