Using Learning Progressions to Guide AI Feedback for Science Learning

(2) Gazi University); Nejla Yuruk (2); Xiaoming Zhai (1) ((1) University of Georgia; Xin Xia (1); Yun Wang (1)

arxiv: 2603.03249 · v2 · submitted 2026-03-03 · 💻 cs.CL

Using Learning Progressions to Guide AI Feedback for Science Learning

Xin Xia (1) , Nejla Yuruk (2) , Yun Wang (1) , Xiaoming Zhai (1) ((1) University of Georgia , (2) Gazi University) This is my paper

Pith reviewed 2026-05-15 16:34 UTC · model grok-4.3

classification 💻 cs.CL

keywords learning progressionsAI feedbackscience educationformative assessmentrubric generationgenerative AImiddle schoolchemistry explanations

0 comments

The pith

Learning progression-based rubrics produce AI feedback for science explanations as effectively as expert-designed rubrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether learning progressions can replace expert rubrics in guiding AI to give feedback on middle school students' chemistry explanations. Researchers generated feedback two ways: one using a traditional expert rubric, the other using a rubric automatically derived from a learning progression. Human evaluators scored both sets of feedback on clarity, accuracy, relevance, engagement, and reflectiveness, finding no meaningful differences between them. If this holds, it suggests a way to scale high-quality AI feedback without needing domain experts to write new rubrics for every task.

Core claim

An LP-driven rubric generation pipeline produced AI feedback on students' scientific explanations that was statistically indistinguishable from feedback guided by a human expert-designed task-specific rubric, as measured across multiple quality dimensions by human coders.

What carries the argument

The learning progression-driven rubric pipeline, which automatically derives task-specific rubrics from a learning progression prior to generating feedback.

If this is right

The LP approach can serve as a scalable alternative to expert rubric authoring for AI feedback systems.
Feedback quality remains comparable in clarity, relevance, engagement, and reflectiveness.
This method may reduce the time and expertise required to create feedback rubrics for new instructional contexts.
The pipeline maintains high inter-rater reliability in evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could enable faster adaptation of AI tutoring systems across different science topics without repeated expert input.
Future work might test whether the same LP-derived rubrics improve actual student learning outcomes, not just feedback quality ratings.
The approach may generalize to other subjects if learning progressions are available.

Load-bearing premise

The multi-dimensional rubric used to evaluate the feedback fully captures what makes feedback educationally valuable, and the learning progression accurately maps student understanding in the chemistry task.

What would settle it

A study where students receiving LP-guided feedback show significantly different learning gains compared to those receiving expert-rubric feedback, as measured by pre-post tests on the chemistry concepts.

Figures

Figures reproduced from arXiv: 2603.03249 by (2) Gazi University), Nejla Yuruk (2), Xiaoming Zhai (1) ((1) University of Georgia, Xin Xia (1), Yun Wang (1).

**Figure 2.** Figure 2: Learning Progression for Evidence-based Explanations [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LP-derived rubrics match expert ones on human ratings of AI feedback but the study gives no data on whether either improves student learning.

read the letter

The paper finds that AI feedback guided by a rubric derived from a learning progression gets similar human ratings to feedback from an expert task rubric, at least for clarity, relevance, engagement, and reflectiveness in this chemistry task. That's the core result from the 207 student explanations. What is new is the head-to-head test of these two pipelines with the same AI model and the same set of student responses. The methods are straightforward: derive the rubric from the LP, generate feedback, then have coders score it on multiple dimensions with good reliability. The t-tests show no significant differences, which supports the idea that LP can help scale rubric creation. The work is solid on the rating comparison and the practical angle for reducing expert effort. It gives a concrete example in middle school science. The soft spots come from what is missing. There are no measures of whether students actually learn more or improve their explanations from the feedback. The ratings are surface-level judgments by coders, not tied to learning outcomes. Without that, it's hard to say the approaches are equivalent in value. The LP-to-rubric step is also only sketched at a high level, with no checks on how well it matches student thinking patterns here. The null results are interesting but don't prove practical sameness, especially without effect sizes or power details. This is for people building AI feedback systems in education or testing learning progressions in real tasks. A reader in that area gets a useful comparison to think about. It deserves a serious referee because the design is replicable and the question is relevant, even if revisions should strengthen the outcome side. I would send it for review.

Referee Report

2 major / 1 minor

Summary. The paper claims that an LP-driven rubric generation pipeline produces AI feedback for middle-school chemistry explanations comparable in quality to expert-authored rubric feedback, as evidenced by no statistically significant differences in paired t-tests on human-coded dimensions (Clarity, Relevance, Engagement/Motivation, Reflectiveness) for 207 student responses, with high inter-rater reliability.

Significance. If the central comparison holds after addressing gaps in validation and outcome measures, the work would demonstrate a scalable, theory-grounded alternative to expert rubric authoring for AI formative feedback in science education, potentially reducing development costs while preserving quality.

major comments (2)

[Abstract] Abstract: the claim that the LP pipeline 'can serve as an alternative solution' rests on coder ratings alone; no pre/post learning gains, explanation improvement scores, or student outcome data are reported for the 207 explanations, leaving the educational effectiveness of the equivalence untested.
[Abstract] Abstract and methods description: the automatic derivation of the task-specific rubric from the LP is described at high level only, with no reported validation that the resulting dimensions align with actual student misconception trajectories or empirical data in this chemistry domain.

minor comments (1)

[Abstract] Abstract: provide the exact sample selection criteria and full statistical reporting (e.g., effect sizes, confidence intervals) beyond the listed t and p values to allow independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below, clarifying the study's scope while planning targeted revisions to improve transparency and detail.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the LP pipeline 'can serve as an alternative solution' rests on coder ratings alone; no pre/post learning gains, explanation improvement scores, or student outcome data are reported for the 207 explanations, leaving the educational effectiveness of the equivalence untested.

Authors: We agree that the study evaluates feedback quality via human coder ratings rather than direct student learning outcomes. The primary objective was to test equivalence in feedback quality between the two pipelines as a necessary first step before outcome studies. No pre/post gains or explanation improvement scores were collected. In revision we will update the abstract and add an explicit limitations paragraph in the discussion that states this scope limitation and outlines planned future work measuring educational effectiveness through student outcome data. revision: partial
Referee: [Abstract] Abstract and methods description: the automatic derivation of the task-specific rubric from the LP is described at high level only, with no reported validation that the resulting dimensions align with actual student misconception trajectories or empirical data in this chemistry domain.

Authors: The methods section describes the pipeline at a summary level. We will expand it with a step-by-step account of how LP levels are mapped to rubric dimensions and criteria. The underlying LP draws on prior empirical work in chemistry, yet the current study does not include new domain-specific validation of the derived dimensions against misconception trajectories. We will add a brief discussion of this point and note it as an area for future empirical validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison with external human ratings

full rationale

The paper reports an empirical study that generates AI feedback via two pipelines (expert rubric vs. LP-derived rubric), collects human coder ratings on multiple dimensions, computes inter-rater reliability, and performs paired t-tests showing no significant differences. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. The central claim rests on external human evaluation data and standard statistical tests rather than any reduction of outputs to inputs by construction. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim depends on the validity of the chosen learning progression as a representation of student understanding and on the assumption that the five evaluation dimensions adequately measure feedback quality. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Learning progressions provide a valid, task-specific representation of developing student understanding that can be converted into rubrics.
Invoked when deriving the automatic rubric from the LP prior to grading.
domain assumption The multi-dimensional human evaluation rubric measures the educational effectiveness of feedback.
Used to conclude equivalence between pipelines.

pith-pipeline@v0.9.0 · 5667 in / 1275 out tokens · 33122 ms · 2026-05-15T16:34:35.908686+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity... These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.
IndisputableMonolith.Cost Jcost unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Learning progressions (LP) provide a theoretically grounded representation of students’ developing understanding

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific ru-brics authored by domain experts

Using Learning Progressions to Guide AI Feedback for Science Learning Xin Xia1[0009-0009-1717-8511], Nejla Yuruk 2[0000-0001-9240-750X], Yun Wang1[0009-0004-6611-0752] and Xiaoming Zhai1[0000-0003-4519-1931] 1 University of Georgia, Athens GA 30605, USA 2 Gazi University, Ankara, 06560, Turkiye xx86245@uga.com Abstract. Generative artificial intelligence ...

work page 1931
[2]

describes an LLM-assisted feedback tool in which instructor-defined criteria are used to guide automated responses to open-ended questions, em-phasizing that criteria-based scaffolding supports more targeted and useful feedback. At the same time, empirical comparisons caution that generative AI feedback may fall short of human feedback on criteria-based q...

work page 2013
[3]

For each student response, the model was prompted to (a) evaluate the response according to the provided rubric and (b) generate formative feedback aligned with the evaluation

Learning Progression for Evidence-based Explanations 3.3 AI Feedback Generation Feedback in both pipelines was generated using the same large language model (GPT-5.1). For each student response, the model was prompted to (a) evaluate the response according to the provided rubric and (b) generate formative feedback aligned with the evaluation. Feedback was...

work page 2025
[4]

Feedback Evaluation T-test. Sub-Dimension Rubric M (SD) LP M (SD) t df p Clarity – Language 2.00 (0.07) 2.00 (0.07) 0.00 206 1.000 Clarity – Structure 1.87 (0.33) 1.85 (0.36) 0.84 206 0.399 Accuracy – Correctness 2.00 (0.00) 2.00 (0.00) — 206 — Accuracy – Terminology 2.00 (0.00) 2.00 (0.00) — 206 — Relevance – Responsiveness 1.97 (0.18) 1.96 (0.19) 0.28 2...

work page 2020
[5]

Pear-son (2011)

McNeill, K.L., Krajcik, J.S.: Supporting Grade 5-8 Students in Constructing Explanations in Science: The Claim, Evidence, and Reasoning Framework for Talk and Writing. Pear-son (2011)

work page 2011
[6]

Shute, V.J.: Focus on Formative Feedback. Rev. Educ. Res. 78, 153–189 (2008). https://doi.org/10.3102/0034654307313795

work page doi:10.3102/0034654307313795 2008
[7]

Hattie, J., Timperley, H.: The Power of Feedback. Rev. Educ. Res. 77, 81–112 (2007). https://doi.org/10.3102/003465430298487

work page doi:10.3102/003465430298487 2007
[8]

Al-Hijr J

Abar, R.O., Pong, M., Som, R.: AI-Driven Feedback Systems for Formative Assessment: Toward Personalized and Real-Time Pedagogy. Al-Hijr J. Adulearn World. 4, 87–100 (2025). https://doi.org/10.55849/alhijr.v4i2.984

work page doi:10.55849/alhijr.v4i2.984 2025
[9]

Deepshikha, D.: A systematic review on the future of educational assessment: AI-driven grading and personalised feedback in higher education. Artif. Intell. Educ. 1–41 (2025). https://doi.org/10.1108/AIIE-03-2025-0036

work page doi:10.1108/aiie-03-2025-0036 2025
[10]

Open-Technology Educ

Bulut, O., Wongvorachan, T.: Feedback Generation through Artificial Intelligence. Open-Technology Educ. Soc. Scholarsh. Assoc. Conf. 2, 1–9 (2022). https://doi.org/10.18357/otessac.2022.2.1.125

work page doi:10.18357/otessac.2022.2.1.125 2022
[11]

VanLehn, K.: The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Sys-tems, and Other Tutoring Systems. Educ. Psychol. 46, 197–221 (2011). https://doi.org/10.1080/00461520.2011.611369

work page doi:10.1080/00461520.2011.611369 2011
[12]

In: 2024 6th International Workshop on Artificial Intelli-gence and Education (WAIE)

Lai, P., Lau, I., Pang, R.: Exploring the Efficacy of Rubric-Based AI Feedback in Enhanc-ing Student Writing Outcomes. In: 2024 6th International Workshop on Artificial Intelli-gence and Education (WAIE). pp. 220–224 (2024). https://doi.org/10.1109/WAIE63876.2024.00047

work page doi:10.1109/waie63876.2024.00047 2024
[13]

Pan, Y.: Leveraging generative AI powered rubric-indexed feedback as a formative as-sessment strategy for enhancing medical English education. Discov. Comput. 28, 284 (2025). https://doi.org/10.1007/s10791-025-09830-9

work page doi:10.1007/s10791-025-09830-9 2025
[14]

Brookhart, S.M.: Appropriate Criteria: Key to Effective Rubrics. Front. Educ. 3, (2018). https://doi.org/10.3389/feduc.2018.00022

work page doi:10.3389/feduc.2018.00022 2018
[15]

Panadero, E., Andrade, H., Brookhart, S.: Fusing self-regulated learning and formative as-sessment: a roadmap of where we are, how we got here, and where we are going. Aust. Educ. Res. 45, 13–31 (2018). https://doi.org/10.1007/s13384-018-0258-y

work page doi:10.1007/s13384-018-0258-y 2018
[16]

Alonzo, A.C., Steedle, J.T.: Developing and assessing a force and motion learning pro-gression. Sci. Educ. 93, 389–421 (2009). https://doi.org/10.1002/sce.20303

work page doi:10.1002/sce.20303 2009
[17]

Duncan, R.G., Hmelo-Silver, C.E.: Learning progressions: Aligning curriculum, instruc-tion, and assessment. J. Res. Sci. Teach. 46, 606–609 (2009). https://doi.org/10.1002/tea.20316

work page doi:10.1002/tea.20316 2009
[18]

Bell, B., Cowie, B.: The characteristics of formative assessment in science education. Sci. Educ. 85, 536–553 (2001). https://doi.org/10.1002/sce.1022. 14 X.Xin1 et al

work page doi:10.1002/sce.1022 2001
[19]

Black, P.: Assessment and feedback in science education. Stud. Educ. Eval. 21, 257–279 (1995). https://doi.org/10.1016/0191-491X(95)00015-M

work page doi:10.1016/0191-491x(95)00015-m 1995
[20]

Koedinger, Sidney K

Koedinger, K.R., D’Mello, S., McLaughlin, E.A., Pardos, Z.A., Rosé, C.P.: Data mining and education. WIREs Cogn. Sci. 6, 333–353 (2015). https://doi.org/10.1002/wcs.1350

work page doi:10.1002/wcs.1350 2015
[21]

Lee, H.-S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., Liu, O.L.: Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Sci. Educ. 103, 590–622 (2019). https://doi.org/10.1002/sce.21504

work page doi:10.1002/sce.21504 2019
[22]

Jescovitch, L.N., Scott, E.E., Cerchiara, J.A., Merrill, J., Urban-Lurain, M., Doherty, J.H., Haudek, K.C.: Comparison of Machine Learning Performance Using Analytic and Holis-tic Coding Approaches Across Constructed Response Assessments Aligned to a Science Learning Progression. J. Sci. Educ. Technol. 30, 150–167 (2021). https://doi.org/10.1007/s10956-02...

work page doi:10.1007/s10956-020-09858-0 2021
[23]

Kaldaras, L., Haudek, K., Krajcik, J.: Employing automatic analysis tools aligned to learn-ing progressions to assess knowledge application and support learning in STEM. Int. J. STEM Educ. 11, 57 (2024). https://doi.org/10.1186/s40594-024-00516-0

work page doi:10.1186/s40594-024-00516-0 2024
[24]

Jukiewicz, M., Wyrwa, M.: Can ChatGPT Replace the Teacher in Assessment? A Review of Research on the Use of Large Language Models in Grading and Providing Feedback. Appl. Sci. 16, (2026). https://doi.org/10.3390/app16020680

work page doi:10.3390/app16020680 2026
[25]

Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., War-schauer, M., Olson, C.B.: Comparing the quality of human and ChatGPT feedback of stu-dents’ writing. Learn. Instr. 91, 101894 (2024). https://doi.org/10.1016/j.learnin-struc.2024.101894

work page doi:10.1016/j.learnin-struc.2024.101894 2024
[26]

Liu, O.L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., Linn, M.C.: Automated Scor-ing of Constructed-Response Science Items: Prospects and Obstacles. Educ. Meas. Issues Pract. 33, 19–28 (2014). https://doi.org/10.1111/emip.12028

work page doi:10.1111/emip.12028 2014
[27]

https://doi.org/10.48550/arXiv.2308.02439

Matelsky, J.K., Parodi, F., Liu, T., Lange, R.D., Kording, K.P.: A large language model-assisted education tool to provide feedback on open-ended responses, http://arxiv.org/abs/2308.02439, (2023). https://doi.org/10.48550/arXiv.2308.02439

work page doi:10.48550/arxiv.2308.02439 2023
[28]

Council, N.R., Education, D. of B. and S.S. and, Education, B. on S., Standards, C. on a C.F. for N.K.-12 S.E.: A Framework for K-12 Science Education: Practices, Crosscutting Concepts, and Core Ideas. National Academies Press (2012)

work page 2012
[29]

Wilson, M.: Measuring progressions: Assessment structures underlying a learning pro-gression. J. Res. Sci. Teach. 46, 716–730 (2009). https://doi.org/10.1002/tea.20318

work page doi:10.1002/tea.20318 2009
[30]

Jin, H., Lima, C., Wang, L.: Automated Scoring in Learning Progression-Based Assess-ment: A Comparison of Researcher and Machine Interpretations. Educ. Meas. Issues Pract. 44, 25–37 (2025). https://doi.org/10.1111/emip.70003

work page doi:10.1111/emip.70003 2025
[31]

Kaldaras, L., Haudek, K.C.: Validation of automated scoring for learning progression-aligned Next Generation Science Standards performance assessments. Front. Educ. 7, (2022). https://doi.org/10.3389/feduc.2022.968289

work page doi:10.3389/feduc.2022.968289 2022
[32]

NSTA Press, National Science Teaching Association (2024)

Harris, C.J., Krajcik, J.S., Pellegrino, J.W.: Creating and using instructionally supportive assessments in NGSS classrooms. NSTA Press, National Science Teaching Association (2024)

work page 2024
[33]

Zhai, X., He, P., Krajcik, J.: Applying machine learning to automatically assess scientific models. J. Res. Sci. Teach. 59, 1765–1794 (2022). https://doi.org/10.1002/tea.21773. Are Task-Specific Human Rubrics Necessary for AI-Generated Feedback 15

work page doi:10.1002/tea.21773 2022
[34]

Gotwals, A.W., Songer, N.B.: Validity Evidence for Learning Progression-Based Assess-ment Items That Fuse Core Disciplinary Ideas and Science Practices. J. Res. Sci. Teach. 50, 597–626 (2013). https://doi.org/10.1002/tea.21083

work page doi:10.1002/tea.21083 2013
[35]

He, P., Shin, N., Zhai, X., Krajcik, J.: A Design Framework for Integrating Artificial Intel-ligence to Support Teachers’ Timely Use of Knowledge-in-Use Assessments. (2023). https://doi.org/10.13140/RG.2.2.19088.58881

work page doi:10.13140/rg.2.2.19088.58881 2023
[36]

Watts, F.M., Liu, L., Ober, T.M., Song, Y., Valle, E.J.-D., Zhai, X., Wang, Y., Liu, N.: A Framework for Designing an AI Chatbot to Support Scientific Argumentation. Educ. Sci. 15, (2025). https://doi.org/10.3390/educsci15111507

work page doi:10.3390/educsci15111507 2025
[37]

Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 20, 37–46 (1960). https://doi.org/10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[38]

McHugh, M.L.: Interrater reliability: the kappa statistic. Biochem. Medica. 22, 276–282 (2012)

work page 2012
[39]

BMC Cancer

Li, M., Gao, Q., Yu, T.: Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters. BMC Cancer. 23, 799 (2023). https://doi.org/10.1186/s12885-023-11325-z

work page doi:10.1186/s12885-023-11325-z 2023
[40]

Zhu, M., Liu, O.L., Lee, H.-S.: The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Comput. Educ. 143, 103668 (2020). https://doi.org/10.1016/j.compedu.2019.103668

work page doi:10.1016/j.compedu.2019.103668 2020
[41]

Nazaretsky, T., Gabbay, H., Käser, T.: Can students judge like experts? A large-scale study on the pedagogical quality of AI and human personalized formative feedback. Com-put. Educ. Artif. Intell. 10, 100533 (2026). https://doi.org/10.1016/j.caeai.2025.100533

work page doi:10.1016/j.caeai.2025.100533 2026
[42]

Zhang, D.-W., Boey, M., Tan, Y.Y., Jia, A.H.S.: Evaluating large language models for criterion-based grading from agreement to consistency. NPJ Sci. Learn. 9, 79 (2024). https://doi.org/10.1038/s41539-024-00291-1

work page doi:10.1038/s41539-024-00291-1 2024

[1] [1]

Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific ru-brics authored by domain experts

Using Learning Progressions to Guide AI Feedback for Science Learning Xin Xia1[0009-0009-1717-8511], Nejla Yuruk 2[0000-0001-9240-750X], Yun Wang1[0009-0004-6611-0752] and Xiaoming Zhai1[0000-0003-4519-1931] 1 University of Georgia, Athens GA 30605, USA 2 Gazi University, Ankara, 06560, Turkiye xx86245@uga.com Abstract. Generative artificial intelligence ...

work page 1931

[2] [2]

describes an LLM-assisted feedback tool in which instructor-defined criteria are used to guide automated responses to open-ended questions, em-phasizing that criteria-based scaffolding supports more targeted and useful feedback. At the same time, empirical comparisons caution that generative AI feedback may fall short of human feedback on criteria-based q...

work page 2013

[3] [3]

For each student response, the model was prompted to (a) evaluate the response according to the provided rubric and (b) generate formative feedback aligned with the evaluation

Learning Progression for Evidence-based Explanations 3.3 AI Feedback Generation Feedback in both pipelines was generated using the same large language model (GPT-5.1). For each student response, the model was prompted to (a) evaluate the response according to the provided rubric and (b) generate formative feedback aligned with the evaluation. Feedback was...

work page 2025

[4] [4]

Feedback Evaluation T-test. Sub-Dimension Rubric M (SD) LP M (SD) t df p Clarity – Language 2.00 (0.07) 2.00 (0.07) 0.00 206 1.000 Clarity – Structure 1.87 (0.33) 1.85 (0.36) 0.84 206 0.399 Accuracy – Correctness 2.00 (0.00) 2.00 (0.00) — 206 — Accuracy – Terminology 2.00 (0.00) 2.00 (0.00) — 206 — Relevance – Responsiveness 1.97 (0.18) 1.96 (0.19) 0.28 2...

work page 2020

[5] [5]

Pear-son (2011)

McNeill, K.L., Krajcik, J.S.: Supporting Grade 5-8 Students in Constructing Explanations in Science: The Claim, Evidence, and Reasoning Framework for Talk and Writing. Pear-son (2011)

work page 2011

[6] [6]

Shute, V.J.: Focus on Formative Feedback. Rev. Educ. Res. 78, 153–189 (2008). https://doi.org/10.3102/0034654307313795

work page doi:10.3102/0034654307313795 2008

[7] [7]

Hattie, J., Timperley, H.: The Power of Feedback. Rev. Educ. Res. 77, 81–112 (2007). https://doi.org/10.3102/003465430298487

work page doi:10.3102/003465430298487 2007

[8] [8]

Al-Hijr J

Abar, R.O., Pong, M., Som, R.: AI-Driven Feedback Systems for Formative Assessment: Toward Personalized and Real-Time Pedagogy. Al-Hijr J. Adulearn World. 4, 87–100 (2025). https://doi.org/10.55849/alhijr.v4i2.984

work page doi:10.55849/alhijr.v4i2.984 2025

[9] [9]

Deepshikha, D.: A systematic review on the future of educational assessment: AI-driven grading and personalised feedback in higher education. Artif. Intell. Educ. 1–41 (2025). https://doi.org/10.1108/AIIE-03-2025-0036

work page doi:10.1108/aiie-03-2025-0036 2025

[10] [10]

Open-Technology Educ

Bulut, O., Wongvorachan, T.: Feedback Generation through Artificial Intelligence. Open-Technology Educ. Soc. Scholarsh. Assoc. Conf. 2, 1–9 (2022). https://doi.org/10.18357/otessac.2022.2.1.125

work page doi:10.18357/otessac.2022.2.1.125 2022

[11] [11]

VanLehn, K.: The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Sys-tems, and Other Tutoring Systems. Educ. Psychol. 46, 197–221 (2011). https://doi.org/10.1080/00461520.2011.611369

work page doi:10.1080/00461520.2011.611369 2011

[12] [12]

In: 2024 6th International Workshop on Artificial Intelli-gence and Education (WAIE)

Lai, P., Lau, I., Pang, R.: Exploring the Efficacy of Rubric-Based AI Feedback in Enhanc-ing Student Writing Outcomes. In: 2024 6th International Workshop on Artificial Intelli-gence and Education (WAIE). pp. 220–224 (2024). https://doi.org/10.1109/WAIE63876.2024.00047

work page doi:10.1109/waie63876.2024.00047 2024

[13] [13]

Pan, Y.: Leveraging generative AI powered rubric-indexed feedback as a formative as-sessment strategy for enhancing medical English education. Discov. Comput. 28, 284 (2025). https://doi.org/10.1007/s10791-025-09830-9

work page doi:10.1007/s10791-025-09830-9 2025

[14] [14]

Brookhart, S.M.: Appropriate Criteria: Key to Effective Rubrics. Front. Educ. 3, (2018). https://doi.org/10.3389/feduc.2018.00022

work page doi:10.3389/feduc.2018.00022 2018

[15] [15]

Panadero, E., Andrade, H., Brookhart, S.: Fusing self-regulated learning and formative as-sessment: a roadmap of where we are, how we got here, and where we are going. Aust. Educ. Res. 45, 13–31 (2018). https://doi.org/10.1007/s13384-018-0258-y

work page doi:10.1007/s13384-018-0258-y 2018

[16] [16]

Alonzo, A.C., Steedle, J.T.: Developing and assessing a force and motion learning pro-gression. Sci. Educ. 93, 389–421 (2009). https://doi.org/10.1002/sce.20303

work page doi:10.1002/sce.20303 2009

[17] [17]

Duncan, R.G., Hmelo-Silver, C.E.: Learning progressions: Aligning curriculum, instruc-tion, and assessment. J. Res. Sci. Teach. 46, 606–609 (2009). https://doi.org/10.1002/tea.20316

work page doi:10.1002/tea.20316 2009

[18] [18]

Bell, B., Cowie, B.: The characteristics of formative assessment in science education. Sci. Educ. 85, 536–553 (2001). https://doi.org/10.1002/sce.1022. 14 X.Xin1 et al

work page doi:10.1002/sce.1022 2001

[19] [19]

Black, P.: Assessment and feedback in science education. Stud. Educ. Eval. 21, 257–279 (1995). https://doi.org/10.1016/0191-491X(95)00015-M

work page doi:10.1016/0191-491x(95)00015-m 1995

[20] [20]

Koedinger, Sidney K

Koedinger, K.R., D’Mello, S., McLaughlin, E.A., Pardos, Z.A., Rosé, C.P.: Data mining and education. WIREs Cogn. Sci. 6, 333–353 (2015). https://doi.org/10.1002/wcs.1350

work page doi:10.1002/wcs.1350 2015

[21] [21]

Lee, H.-S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., Liu, O.L.: Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Sci. Educ. 103, 590–622 (2019). https://doi.org/10.1002/sce.21504

work page doi:10.1002/sce.21504 2019

[22] [22]

Jescovitch, L.N., Scott, E.E., Cerchiara, J.A., Merrill, J., Urban-Lurain, M., Doherty, J.H., Haudek, K.C.: Comparison of Machine Learning Performance Using Analytic and Holis-tic Coding Approaches Across Constructed Response Assessments Aligned to a Science Learning Progression. J. Sci. Educ. Technol. 30, 150–167 (2021). https://doi.org/10.1007/s10956-02...

work page doi:10.1007/s10956-020-09858-0 2021

[23] [23]

Kaldaras, L., Haudek, K., Krajcik, J.: Employing automatic analysis tools aligned to learn-ing progressions to assess knowledge application and support learning in STEM. Int. J. STEM Educ. 11, 57 (2024). https://doi.org/10.1186/s40594-024-00516-0

work page doi:10.1186/s40594-024-00516-0 2024

[24] [24]

Jukiewicz, M., Wyrwa, M.: Can ChatGPT Replace the Teacher in Assessment? A Review of Research on the Use of Large Language Models in Grading and Providing Feedback. Appl. Sci. 16, (2026). https://doi.org/10.3390/app16020680

work page doi:10.3390/app16020680 2026

[25] [25]

Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., War-schauer, M., Olson, C.B.: Comparing the quality of human and ChatGPT feedback of stu-dents’ writing. Learn. Instr. 91, 101894 (2024). https://doi.org/10.1016/j.learnin-struc.2024.101894

work page doi:10.1016/j.learnin-struc.2024.101894 2024

[26] [26]

Liu, O.L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., Linn, M.C.: Automated Scor-ing of Constructed-Response Science Items: Prospects and Obstacles. Educ. Meas. Issues Pract. 33, 19–28 (2014). https://doi.org/10.1111/emip.12028

work page doi:10.1111/emip.12028 2014

[27] [27]

https://doi.org/10.48550/arXiv.2308.02439

Matelsky, J.K., Parodi, F., Liu, T., Lange, R.D., Kording, K.P.: A large language model-assisted education tool to provide feedback on open-ended responses, http://arxiv.org/abs/2308.02439, (2023). https://doi.org/10.48550/arXiv.2308.02439

work page doi:10.48550/arxiv.2308.02439 2023

[28] [28]

Council, N.R., Education, D. of B. and S.S. and, Education, B. on S., Standards, C. on a C.F. for N.K.-12 S.E.: A Framework for K-12 Science Education: Practices, Crosscutting Concepts, and Core Ideas. National Academies Press (2012)

work page 2012

[29] [29]

Wilson, M.: Measuring progressions: Assessment structures underlying a learning pro-gression. J. Res. Sci. Teach. 46, 716–730 (2009). https://doi.org/10.1002/tea.20318

work page doi:10.1002/tea.20318 2009

[30] [30]

Jin, H., Lima, C., Wang, L.: Automated Scoring in Learning Progression-Based Assess-ment: A Comparison of Researcher and Machine Interpretations. Educ. Meas. Issues Pract. 44, 25–37 (2025). https://doi.org/10.1111/emip.70003

work page doi:10.1111/emip.70003 2025

[31] [31]

Kaldaras, L., Haudek, K.C.: Validation of automated scoring for learning progression-aligned Next Generation Science Standards performance assessments. Front. Educ. 7, (2022). https://doi.org/10.3389/feduc.2022.968289

work page doi:10.3389/feduc.2022.968289 2022

[32] [32]

NSTA Press, National Science Teaching Association (2024)

Harris, C.J., Krajcik, J.S., Pellegrino, J.W.: Creating and using instructionally supportive assessments in NGSS classrooms. NSTA Press, National Science Teaching Association (2024)

work page 2024

[33] [33]

Zhai, X., He, P., Krajcik, J.: Applying machine learning to automatically assess scientific models. J. Res. Sci. Teach. 59, 1765–1794 (2022). https://doi.org/10.1002/tea.21773. Are Task-Specific Human Rubrics Necessary for AI-Generated Feedback 15

work page doi:10.1002/tea.21773 2022

[34] [34]

Gotwals, A.W., Songer, N.B.: Validity Evidence for Learning Progression-Based Assess-ment Items That Fuse Core Disciplinary Ideas and Science Practices. J. Res. Sci. Teach. 50, 597–626 (2013). https://doi.org/10.1002/tea.21083

work page doi:10.1002/tea.21083 2013

[35] [35]

He, P., Shin, N., Zhai, X., Krajcik, J.: A Design Framework for Integrating Artificial Intel-ligence to Support Teachers’ Timely Use of Knowledge-in-Use Assessments. (2023). https://doi.org/10.13140/RG.2.2.19088.58881

work page doi:10.13140/rg.2.2.19088.58881 2023

[36] [36]

Watts, F.M., Liu, L., Ober, T.M., Song, Y., Valle, E.J.-D., Zhai, X., Wang, Y., Liu, N.: A Framework for Designing an AI Chatbot to Support Scientific Argumentation. Educ. Sci. 15, (2025). https://doi.org/10.3390/educsci15111507

work page doi:10.3390/educsci15111507 2025

[37] [37]

Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 20, 37–46 (1960). https://doi.org/10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960

[38] [38]

McHugh, M.L.: Interrater reliability: the kappa statistic. Biochem. Medica. 22, 276–282 (2012)

work page 2012

[39] [39]

BMC Cancer

Li, M., Gao, Q., Yu, T.: Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters. BMC Cancer. 23, 799 (2023). https://doi.org/10.1186/s12885-023-11325-z

work page doi:10.1186/s12885-023-11325-z 2023

[40] [40]

Zhu, M., Liu, O.L., Lee, H.-S.: The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Comput. Educ. 143, 103668 (2020). https://doi.org/10.1016/j.compedu.2019.103668

work page doi:10.1016/j.compedu.2019.103668 2020

[41] [41]

Nazaretsky, T., Gabbay, H., Käser, T.: Can students judge like experts? A large-scale study on the pedagogical quality of AI and human personalized formative feedback. Com-put. Educ. Artif. Intell. 10, 100533 (2026). https://doi.org/10.1016/j.caeai.2025.100533

work page doi:10.1016/j.caeai.2025.100533 2026

[42] [42]

Zhang, D.-W., Boey, M., Tan, Y.Y., Jia, A.H.S.: Evaluating large language models for criterion-based grading from agreement to consistency. NPJ Sci. Learn. 9, 79 (2024). https://doi.org/10.1038/s41539-024-00291-1

work page doi:10.1038/s41539-024-00291-1 2024