pith. sign in

arxiv: 2603.03249 · v2 · submitted 2026-03-03 · 💻 cs.CL

Using Learning Progressions to Guide AI Feedback for Science Learning

Pith reviewed 2026-05-15 16:34 UTC · model grok-4.3

classification 💻 cs.CL
keywords learning progressionsAI feedbackscience educationformative assessmentrubric generationgenerative AImiddle schoolchemistry explanations
0
0 comments X

The pith

Learning progression-based rubrics produce AI feedback for science explanations as effectively as expert-designed rubrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether learning progressions can replace expert rubrics in guiding AI to give feedback on middle school students' chemistry explanations. Researchers generated feedback two ways: one using a traditional expert rubric, the other using a rubric automatically derived from a learning progression. Human evaluators scored both sets of feedback on clarity, accuracy, relevance, engagement, and reflectiveness, finding no meaningful differences between them. If this holds, it suggests a way to scale high-quality AI feedback without needing domain experts to write new rubrics for every task.

Core claim

An LP-driven rubric generation pipeline produced AI feedback on students' scientific explanations that was statistically indistinguishable from feedback guided by a human expert-designed task-specific rubric, as measured across multiple quality dimensions by human coders.

What carries the argument

The learning progression-driven rubric pipeline, which automatically derives task-specific rubrics from a learning progression prior to generating feedback.

If this is right

  • The LP approach can serve as a scalable alternative to expert rubric authoring for AI feedback systems.
  • Feedback quality remains comparable in clarity, relevance, engagement, and reflectiveness.
  • This method may reduce the time and expertise required to create feedback rubrics for new instructional contexts.
  • The pipeline maintains high inter-rater reliability in evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could enable faster adaptation of AI tutoring systems across different science topics without repeated expert input.
  • Future work might test whether the same LP-derived rubrics improve actual student learning outcomes, not just feedback quality ratings.
  • The approach may generalize to other subjects if learning progressions are available.

Load-bearing premise

The multi-dimensional rubric used to evaluate the feedback fully captures what makes feedback educationally valuable, and the learning progression accurately maps student understanding in the chemistry task.

What would settle it

A study where students receiving LP-guided feedback show significantly different learning gains compared to those receiving expert-rubric feedback, as measured by pre-post tests on the chemistry concepts.

Figures

Figures reproduced from arXiv: 2603.03249 by (2) Gazi University), Nejla Yuruk (2), Xiaoming Zhai (1) ((1) University of Georgia, Xin Xia (1), Yun Wang (1).

Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Learning Progression for Evidence-based Explanations [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that an LP-driven rubric generation pipeline produces AI feedback for middle-school chemistry explanations comparable in quality to expert-authored rubric feedback, as evidenced by no statistically significant differences in paired t-tests on human-coded dimensions (Clarity, Relevance, Engagement/Motivation, Reflectiveness) for 207 student responses, with high inter-rater reliability.

Significance. If the central comparison holds after addressing gaps in validation and outcome measures, the work would demonstrate a scalable, theory-grounded alternative to expert rubric authoring for AI formative feedback in science education, potentially reducing development costs while preserving quality.

major comments (2)
  1. [Abstract] Abstract: the claim that the LP pipeline 'can serve as an alternative solution' rests on coder ratings alone; no pre/post learning gains, explanation improvement scores, or student outcome data are reported for the 207 explanations, leaving the educational effectiveness of the equivalence untested.
  2. [Abstract] Abstract and methods description: the automatic derivation of the task-specific rubric from the LP is described at high level only, with no reported validation that the resulting dimensions align with actual student misconception trajectories or empirical data in this chemistry domain.
minor comments (1)
  1. [Abstract] Abstract: provide the exact sample selection criteria and full statistical reporting (e.g., effect sizes, confidence intervals) beyond the listed t and p values to allow independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below, clarifying the study's scope while planning targeted revisions to improve transparency and detail.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the LP pipeline 'can serve as an alternative solution' rests on coder ratings alone; no pre/post learning gains, explanation improvement scores, or student outcome data are reported for the 207 explanations, leaving the educational effectiveness of the equivalence untested.

    Authors: We agree that the study evaluates feedback quality via human coder ratings rather than direct student learning outcomes. The primary objective was to test equivalence in feedback quality between the two pipelines as a necessary first step before outcome studies. No pre/post gains or explanation improvement scores were collected. In revision we will update the abstract and add an explicit limitations paragraph in the discussion that states this scope limitation and outlines planned future work measuring educational effectiveness through student outcome data. revision: partial

  2. Referee: [Abstract] Abstract and methods description: the automatic derivation of the task-specific rubric from the LP is described at high level only, with no reported validation that the resulting dimensions align with actual student misconception trajectories or empirical data in this chemistry domain.

    Authors: The methods section describes the pipeline at a summary level. We will expand it with a step-by-step account of how LP levels are mapped to rubric dimensions and criteria. The underlying LP draws on prior empirical work in chemistry, yet the current study does not include new domain-specific validation of the derived dimensions against misconception trajectories. We will add a brief discussion of this point and note it as an area for future empirical validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison with external human ratings

full rationale

The paper reports an empirical study that generates AI feedback via two pipelines (expert rubric vs. LP-derived rubric), collects human coder ratings on multiple dimensions, computes inter-rater reliability, and performs paired t-tests showing no significant differences. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. The central claim rests on external human evaluation data and standard statistical tests rather than any reduction of outputs to inputs by construction. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim depends on the validity of the chosen learning progression as a representation of student understanding and on the assumption that the five evaluation dimensions adequately measure feedback quality. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Learning progressions provide a valid, task-specific representation of developing student understanding that can be converted into rubrics.
    Invoked when deriving the automatic rubric from the LP prior to grading.
  • domain assumption The multi-dimensional human evaluation rubric measures the educational effectiveness of feedback.
    Used to conclude equivalence between pipelines.

pith-pipeline@v0.9.0 · 5667 in / 1275 out tokens · 33122 ms · 2026-05-15T16:34:35.908686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity... These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.

  • IndisputableMonolith.Cost Jcost unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Learning progressions (LP) provide a theoretically grounded representation of students’ developing understanding

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific ru-brics authored by domain experts

    Using Learning Progressions to Guide AI Feedback for Science Learning Xin Xia1[0009-0009-1717-8511], Nejla Yuruk 2[0000-0001-9240-750X], Yun Wang1[0009-0004-6611-0752] and Xiaoming Zhai1[0000-0003-4519-1931] 1 University of Georgia, Athens GA 30605, USA 2 Gazi University, Ankara, 06560, Turkiye xx86245@uga.com Abstract. Generative artificial intelligence ...

  2. [2]

    describes an LLM-assisted feedback tool in which instructor-defined criteria are used to guide automated responses to open-ended questions, em-phasizing that criteria-based scaffolding supports more targeted and useful feedback. At the same time, empirical comparisons caution that generative AI feedback may fall short of human feedback on criteria-based q...

  3. [3]

    For each student response, the model was prompted to (a) evaluate the response according to the provided rubric and (b) generate formative feedback aligned with the evaluation

    Learning Progression for Evidence-based Explanations 3.3 AI Feedback Generation Feedback in both pipelines was generated using the same large language model (GPT-5.1). For each student response, the model was prompted to (a) evaluate the response according to the provided rubric and (b) generate formative feedback aligned with the evaluation. Feedback was...

  4. [4]

    Feedback Evaluation T-test. Sub-Dimension Rubric M (SD) LP M (SD) t df p Clarity – Language 2.00 (0.07) 2.00 (0.07) 0.00 206 1.000 Clarity – Structure 1.87 (0.33) 1.85 (0.36) 0.84 206 0.399 Accuracy – Correctness 2.00 (0.00) 2.00 (0.00) — 206 — Accuracy – Terminology 2.00 (0.00) 2.00 (0.00) — 206 — Relevance – Responsiveness 1.97 (0.18) 1.96 (0.19) 0.28 2...

  5. [5]

    Pear-son (2011)

    McNeill, K.L., Krajcik, J.S.: Supporting Grade 5-8 Students in Constructing Explanations in Science: The Claim, Evidence, and Reasoning Framework for Talk and Writing. Pear-son (2011)

  6. [6]

    Shute, V.J.: Focus on Formative Feedback. Rev. Educ. Res. 78, 153–189 (2008). https://doi.org/10.3102/0034654307313795

  7. [7]

    Hattie, J., Timperley, H.: The Power of Feedback. Rev. Educ. Res. 77, 81–112 (2007). https://doi.org/10.3102/003465430298487

  8. [8]

    Al-Hijr J

    Abar, R.O., Pong, M., Som, R.: AI-Driven Feedback Systems for Formative Assessment: Toward Personalized and Real-Time Pedagogy. Al-Hijr J. Adulearn World. 4, 87–100 (2025). https://doi.org/10.55849/alhijr.v4i2.984

  9. [9]

    Deepshikha, D.: A systematic review on the future of educational assessment: AI-driven grading and personalised feedback in higher education. Artif. Intell. Educ. 1–41 (2025). https://doi.org/10.1108/AIIE-03-2025-0036

  10. [10]

    Open-Technology Educ

    Bulut, O., Wongvorachan, T.: Feedback Generation through Artificial Intelligence. Open-Technology Educ. Soc. Scholarsh. Assoc. Conf. 2, 1–9 (2022). https://doi.org/10.18357/otessac.2022.2.1.125

  11. [11]

    VanLehn, K.: The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Sys-tems, and Other Tutoring Systems. Educ. Psychol. 46, 197–221 (2011). https://doi.org/10.1080/00461520.2011.611369

  12. [12]

    In: 2024 6th International Workshop on Artificial Intelli-gence and Education (WAIE)

    Lai, P., Lau, I., Pang, R.: Exploring the Efficacy of Rubric-Based AI Feedback in Enhanc-ing Student Writing Outcomes. In: 2024 6th International Workshop on Artificial Intelli-gence and Education (WAIE). pp. 220–224 (2024). https://doi.org/10.1109/WAIE63876.2024.00047

  13. [13]

    Pan, Y.: Leveraging generative AI powered rubric-indexed feedback as a formative as-sessment strategy for enhancing medical English education. Discov. Comput. 28, 284 (2025). https://doi.org/10.1007/s10791-025-09830-9

  14. [14]

    Brookhart, S.M.: Appropriate Criteria: Key to Effective Rubrics. Front. Educ. 3, (2018). https://doi.org/10.3389/feduc.2018.00022

  15. [15]

    Panadero, E., Andrade, H., Brookhart, S.: Fusing self-regulated learning and formative as-sessment: a roadmap of where we are, how we got here, and where we are going. Aust. Educ. Res. 45, 13–31 (2018). https://doi.org/10.1007/s13384-018-0258-y

  16. [16]

    Alonzo, A.C., Steedle, J.T.: Developing and assessing a force and motion learning pro-gression. Sci. Educ. 93, 389–421 (2009). https://doi.org/10.1002/sce.20303

  17. [17]

    Duncan, R.G., Hmelo-Silver, C.E.: Learning progressions: Aligning curriculum, instruc-tion, and assessment. J. Res. Sci. Teach. 46, 606–609 (2009). https://doi.org/10.1002/tea.20316

  18. [18]

    Bell, B., Cowie, B.: The characteristics of formative assessment in science education. Sci. Educ. 85, 536–553 (2001). https://doi.org/10.1002/sce.1022. 14 X.Xin1 et al

  19. [19]

    Black, P.: Assessment and feedback in science education. Stud. Educ. Eval. 21, 257–279 (1995). https://doi.org/10.1016/0191-491X(95)00015-M

  20. [20]

    Koedinger, Sidney K

    Koedinger, K.R., D’Mello, S., McLaughlin, E.A., Pardos, Z.A., Rosé, C.P.: Data mining and education. WIREs Cogn. Sci. 6, 333–353 (2015). https://doi.org/10.1002/wcs.1350

  21. [21]

    Lee, H.-S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., Liu, O.L.: Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Sci. Educ. 103, 590–622 (2019). https://doi.org/10.1002/sce.21504

  22. [22]

    Jescovitch, L.N., Scott, E.E., Cerchiara, J.A., Merrill, J., Urban-Lurain, M., Doherty, J.H., Haudek, K.C.: Comparison of Machine Learning Performance Using Analytic and Holis-tic Coding Approaches Across Constructed Response Assessments Aligned to a Science Learning Progression. J. Sci. Educ. Technol. 30, 150–167 (2021). https://doi.org/10.1007/s10956-02...

  23. [23]

    Kaldaras, L., Haudek, K., Krajcik, J.: Employing automatic analysis tools aligned to learn-ing progressions to assess knowledge application and support learning in STEM. Int. J. STEM Educ. 11, 57 (2024). https://doi.org/10.1186/s40594-024-00516-0

  24. [24]

    Jukiewicz, M., Wyrwa, M.: Can ChatGPT Replace the Teacher in Assessment? A Review of Research on the Use of Large Language Models in Grading and Providing Feedback. Appl. Sci. 16, (2026). https://doi.org/10.3390/app16020680

  25. [25]

    Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., War-schauer, M., Olson, C.B.: Comparing the quality of human and ChatGPT feedback of stu-dents’ writing. Learn. Instr. 91, 101894 (2024). https://doi.org/10.1016/j.learnin-struc.2024.101894

  26. [26]

    Liu, O.L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., Linn, M.C.: Automated Scor-ing of Constructed-Response Science Items: Prospects and Obstacles. Educ. Meas. Issues Pract. 33, 19–28 (2014). https://doi.org/10.1111/emip.12028

  27. [27]

    https://doi.org/10.48550/arXiv.2308.02439

    Matelsky, J.K., Parodi, F., Liu, T., Lange, R.D., Kording, K.P.: A large language model-assisted education tool to provide feedback on open-ended responses, http://arxiv.org/abs/2308.02439, (2023). https://doi.org/10.48550/arXiv.2308.02439

  28. [28]

    Council, N.R., Education, D. of B. and S.S. and, Education, B. on S., Standards, C. on a C.F. for N.K.-12 S.E.: A Framework for K-12 Science Education: Practices, Crosscutting Concepts, and Core Ideas. National Academies Press (2012)

  29. [29]

    Wilson, M.: Measuring progressions: Assessment structures underlying a learning pro-gression. J. Res. Sci. Teach. 46, 716–730 (2009). https://doi.org/10.1002/tea.20318

  30. [30]

    Jin, H., Lima, C., Wang, L.: Automated Scoring in Learning Progression-Based Assess-ment: A Comparison of Researcher and Machine Interpretations. Educ. Meas. Issues Pract. 44, 25–37 (2025). https://doi.org/10.1111/emip.70003

  31. [31]

    Kaldaras, L., Haudek, K.C.: Validation of automated scoring for learning progression-aligned Next Generation Science Standards performance assessments. Front. Educ. 7, (2022). https://doi.org/10.3389/feduc.2022.968289

  32. [32]

    NSTA Press, National Science Teaching Association (2024)

    Harris, C.J., Krajcik, J.S., Pellegrino, J.W.: Creating and using instructionally supportive assessments in NGSS classrooms. NSTA Press, National Science Teaching Association (2024)

  33. [33]

    Zhai, X., He, P., Krajcik, J.: Applying machine learning to automatically assess scientific models. J. Res. Sci. Teach. 59, 1765–1794 (2022). https://doi.org/10.1002/tea.21773. Are Task-Specific Human Rubrics Necessary for AI-Generated Feedback 15

  34. [34]

    Gotwals, A.W., Songer, N.B.: Validity Evidence for Learning Progression-Based Assess-ment Items That Fuse Core Disciplinary Ideas and Science Practices. J. Res. Sci. Teach. 50, 597–626 (2013). https://doi.org/10.1002/tea.21083

  35. [35]

    He, P., Shin, N., Zhai, X., Krajcik, J.: A Design Framework for Integrating Artificial Intel-ligence to Support Teachers’ Timely Use of Knowledge-in-Use Assessments. (2023). https://doi.org/10.13140/RG.2.2.19088.58881

  36. [36]

    Watts, F.M., Liu, L., Ober, T.M., Song, Y., Valle, E.J.-D., Zhai, X., Wang, Y., Liu, N.: A Framework for Designing an AI Chatbot to Support Scientific Argumentation. Educ. Sci. 15, (2025). https://doi.org/10.3390/educsci15111507

  37. [37]

    Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 20, 37–46 (1960). https://doi.org/10.1177/001316446002000104

  38. [38]

    McHugh, M.L.: Interrater reliability: the kappa statistic. Biochem. Medica. 22, 276–282 (2012)

  39. [39]

    BMC Cancer

    Li, M., Gao, Q., Yu, T.: Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters. BMC Cancer. 23, 799 (2023). https://doi.org/10.1186/s12885-023-11325-z

  40. [40]

    Zhu, M., Liu, O.L., Lee, H.-S.: The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Comput. Educ. 143, 103668 (2020). https://doi.org/10.1016/j.compedu.2019.103668

  41. [41]

    Nazaretsky, T., Gabbay, H., Käser, T.: Can students judge like experts? A large-scale study on the pedagogical quality of AI and human personalized formative feedback. Com-put. Educ. Artif. Intell. 10, 100533 (2026). https://doi.org/10.1016/j.caeai.2025.100533

  42. [42]

    Zhang, D.-W., Boey, M., Tan, Y.Y., Jia, A.H.S.: Evaluating large language models for criterion-based grading from agreement to consistency. NPJ Sci. Learn. 9, 79 (2024). https://doi.org/10.1038/s41539-024-00291-1