pith. machine review for the scientific record. sign in

arxiv: 2604.25472 · v1 · submitted 2026-04-28 · 💻 cs.AI

Recognition: unknown

SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:28 UTC · model grok-4.3

classification 💻 cs.AI
keywords benchmarkLLM evaluationK-12 scienceinstructional materialsAIMEfine-tuningEQuIP rubriceducational AI
0
0 comments X

The pith

Fine-tuning large language models on expert-annotated K-12 science lessons improves evaluation performance by up to 11 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce SciEval, a dataset of 273 K-12 science instructional materials with expert annotations using the EQuIP rubric across 13 criteria. They test several mainstream LLMs on the task of predicting scores and evidence, finding none perform strongly. Fine-tuning the Qwen3 model on this data produces gains of as much as 11 percent on a held-out test set. This suggests that domain-specific training can make LLMs more useful for automating the review of educational content at scale.

Core claim

We formulate Automatic Instructional Materials Evaluation (AIME) as a generative AI task and create the SciEval benchmark consisting of 273 lessons annotated with pedagogy-aligned scores and evidence. Mainstream LLMs show weak performance on SciEval, but domain-aligned fine-tuning of Qwen3 achieves up to 11 percent performance gains on held-out data.

What carries the argument

The SciEval dataset of expert-annotated instructional materials using the EQuIP rubric, which serves as training data to improve LLM predictions of evaluation scores and rationales.

If this is right

  • Automated checks on AI-generated lessons could become practical for teachers and schools.
  • Similar fine-tuning strategies may enhance LLM performance on other specialized educational tasks.
  • Reliable automated evaluation might help maintain quality standards as generative AI use grows in education.
  • The benchmark enables further research into improving model reliability for this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If successful, such models could be embedded in lesson-planning tools to flag weaknesses early.
  • Extending the dataset to more lessons or subjects might broaden the applicability of these gains.
  • This work highlights a path for using LLMs to support rather than replace expert judgment in education.

Load-bearing premise

The EQuIP rubric annotations provided by experts form a reliable ground truth that reflects true quality in K-12 science instructional materials, and the 273 lessons sufficiently represent the range of possible content.

What would settle it

A replication study where different experts re-annotate the same lessons and find low agreement with the original scores, or where fine-tuned models fail to outperform base models on a fresh collection of instructional materials from other sources.

Figures

Figures reproduced from arXiv: 2604.25472 by Honglu Liu, Jinjun Xiong, Peng He, Tingting Li, Zeyuan Wang, Zhaohui Li, Zhiyuan Chen.

Figure 1
Figure 1. Figure 1: The left summarizes the EQuIP rubric criteria, and the right shows the corresponding lesson view at source ↗
Figure 2
Figure 2. Figure 2: Assistant-only mask supervision used in SciEval fine-tuning. view at source ↗
Figure 3
Figure 3. Figure 3: Distribution alignment and per-class accuracy across models on the SciEval test set. Note view at source ↗
read the original abstract

The need to evaluate instructional materials for K-12 science education has become increasingly important, as more educators use generative AI to create instructional materials. However, the review of instructional materials is time-consuming, expertise-intensive, and difficult to scale, motivating interest in automated evaluation approaches. While large language models (LLMs) have shown strong performance on general evaluation tasks, their performance and reliability on instructional materials remain unclear. To address this gap, we formulate Automatic Instructional Materials Evaluation (AIME) as a generative AI task that predicts scores and evidence using the rubric designed by the educator. We create a benchmark dataset and develop baseline models for AIME. First, we curate the first AIME dataset, SciEval, consisting of instructional materials annotated with pedagogy-aligned evaluation scores and evidence-based rationales. Expert annotations achieve high inter-rater reliability, resulting in a dataset of 273 lesson-level instructional materials evaluated across 13 criteria (N=3549) using the EQuIP rubric. Second, we test mainstream LLMs (GPT, Gemini, Llama, and Qwen) on SciEval and find that none achieve strong performance. Then we fine-tune Qwen3 on SciEval. Results on a held-out test set show that domain-aligned fine-tuning can achieve up to 11 percent performance gains, highlighting the importance of domain-specific fine-tuning for AIME and facilitating the use of LLMs in other educational tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SciEval as the first benchmark dataset for Automatic Instructional Materials Evaluation (AIME), consisting of 273 K-12 science lesson instructional materials annotated by experts on 13 EQuIP rubric criteria (N=3549 annotations). It evaluates mainstream LLMs (GPT, Gemini, Llama, Qwen) on predicting scores and evidence-based rationales, reports that none achieve strong performance, and shows that domain-aligned fine-tuning of Qwen3 on SciEval yields up to 11% performance gains on a held-out test set.

Significance. If the expert annotations are verifiably reliable and the dataset representative, SciEval would provide a valuable, domain-specific resource for scaling evaluation of instructional materials and for training LLMs on education tasks. The empirical result on fine-tuning gains offers a concrete demonstration of the value of domain adaptation in this setting. The absence of numeric reliability metrics and experimental details, however, limits the strength of these contributions.

major comments (3)
  1. [Abstract] Abstract: the claim that 'Expert annotations achieve high inter-rater reliability' supplies no quantitative values (kappa, ICC, agreement percentages), rater count, or conflict-resolution protocol. This is load-bearing for the central 11% fine-tuning gain claim, as annotation noise of comparable magnitude would render the improvement indistinguishable from label variance.
  2. [Results] Results section: the 'up to 11 percent performance gains' on the held-out set are reported without defining the exact metric (score accuracy, evidence quality, or composite), absolute baseline and fine-tuned scores, train/validation/test split sizes and sampling method, or statistical significance tests. These details are required to interpret the magnitude and robustness of the improvement.
  3. [Dataset] Dataset curation section: the 273-lesson corpus is presented without sampling frame, diversity statistics (grade-level or topic distribution), or representativeness justification relative to broader K-12 science materials. This weakens any general claims about LLM performance on instructional material evaluation.
minor comments (2)
  1. [Abstract] Abstract: the acronym AIME is used without expansion on first mention.
  2. [Evaluation] Evaluation section: prompting templates and input formatting for the LLM baselines and fine-tuned model should be provided (or linked) to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us identify areas where the manuscript can be made more transparent and rigorous. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'Expert annotations achieve high inter-rater reliability' supplies no quantitative values (kappa, ICC, agreement percentages), rater count, or conflict-resolution protocol. This is load-bearing for the central 11% fine-tuning gain claim, as annotation noise of comparable magnitude would render the improvement indistinguishable from label variance.

    Authors: We agree that the abstract must include quantitative support for the reliability claim to allow proper interpretation of the fine-tuning results. The original manuscript describes the annotation process with three expert raters and reports high agreement in the Dataset section, but these specifics were not quantified in the abstract. In the revision, we have updated the abstract to state the rater count, average agreement percentage, and Cohen's kappa value, and we have added a concise description of the consensus protocol. We have also expanded the Methods section with the full protocol details. This change directly addresses the concern about annotation noise relative to the reported gains. revision: yes

  2. Referee: [Results] Results section: the 'up to 11 percent performance gains' on the held-out set are reported without defining the exact metric (score accuracy, evidence quality, or composite), absolute baseline and fine-tuned scores, train/validation/test split sizes and sampling method, or statistical significance tests. These details are required to interpret the magnitude and robustness of the improvement.

    Authors: We concur that the Results section requires these specifics for interpretability. The 11% figure refers to improvement in composite score accuracy (combining score prediction and evidence quality). In the revised manuscript, we now explicitly define the metric, report absolute baseline and fine-tuned scores for Qwen3 and other models, describe the 70/15/15 stratified split by grade and topic, and include statistical significance testing (paired t-test, p < 0.05). A new table has been added summarizing all values and the evaluation protocol. revision: yes

  3. Referee: [Dataset] Dataset curation section: the 273-lesson corpus is presented without sampling frame, diversity statistics (grade-level or topic distribution), or representativeness justification relative to broader K-12 science materials. This weakens any general claims about LLM performance on instructional material evaluation.

    Authors: The referee is correct that additional context strengthens the dataset description. We have revised the Dataset curation section to specify the sampling frame (lessons drawn from open K-12 repositories aligned with NGSS), include diversity statistics (grade-level and topic distributions), and provide a justification of representativeness for typical U.S. K-12 science instructional materials. These additions clarify the scope of our claims about LLM performance without overstating generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark and evaluation

full rationale

The paper creates the SciEval dataset by curating 273 lessons annotated by experts using the external EQuIP rubric across 13 criteria, then evaluates mainstream LLMs and fine-tunes Qwen3, reporting performance on a held-out test set. No equations, derivations, fitted parameters renamed as predictions, or self-citations are used to justify any central claim. Results are direct empirical measurements against held-out data and the independent rubric, with no reduction of outputs to inputs by construction. The work is self-contained empirical research.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central results rest on the assumption that expert EQuIP annotations are valid ground truth and that the curated lessons represent typical K-12 science instructional materials.

axioms (1)
  • domain assumption EQuIP rubric provides valid and comprehensive criteria for evaluating K-12 science instructional materials
    Used as the fixed scoring framework for all annotations and model evaluation.

pith-pipeline@v0.9.0 · 5572 in / 1269 out tokens · 48399 ms · 2026-05-07T16:28:05.461452+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    org/10.48550/arXiv.2508.04442,https://arxiv.org/abs/2508.04442

    Abdul Wahid, R., Nadim, M.S.N., Sulaiman, S., Shaharudin, S.A., Jupikil, M.D., Su, I.J.S.A.: Automated generation of curriculum-aligned multiple-choice questions for malaysian secondary mathematics using generative ai (2025).https://doi. org/10.48550/arXiv.2508.04442,https://arxiv.org/abs/2508.04442

  2. [2]

    nextgenscience.org/resources/equip-rubric-lessons-units-science(2016)

    Achieve, Inc.: Equip rubric for lessons & units: Science (version 3.0).https://www. nextgenscience.org/resources/equip-rubric-lessons-units-science(2016)

  3. [3]

    Biological Sciences Curriculum Study (BSCS): Field-test version: Bscs science learning ngss-aligned instructional materials.https://bscs.org/ bscs-science-learning/(2019), accessed August 2025

  4. [4]

    arXiv preprint arXiv:2405.17284 (2024) 14 Z

    Camilli, G.: An nlp crosswalk between the common core state standards and naep item specifications. arXiv preprint arXiv:2405.17284 (2024) 14 Z. Li et al

  5. [5]

    carnegie.org/news/articles/instructional-materials-matter/(2017)

    Carnegie Corporation of New York: Instructional materials matter.https://www. carnegie.org/news/articles/instructional-materials-matter/(2017)

  6. [6]

    Clark, H.B., Margetts, M., Beaven, T., Braverman, B., Biggart, A.M.: Auto- evaluation: A critical measure in driving improvements in quality and safety of ai-generated lesson resources (2025)

  7. [7]

    Educational and psycho- logical measurement20(1), 37–46 (1960)

    Cohen, J.: A coefficient of agreement for nominal scales. Educational and psycho- logical measurement20(1), 37–46 (1960)

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  9. [9]

    Advances in neural information processing systems36, 10088–10115 (2023)

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient fine- tuning of quantized llms. Advances in neural information processing systems36, 10088–10115 (2023)

  10. [10]

    arXiv e-prints pp

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

  11. [11]

    In: Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers

    Fu, Y., Jiao, H., Zhou, T., Zhang, N., Li, M., Xu, Q., Peters, S., Lissitz, R.W.: Text- based approaches to item alignment to content standards in large-scale reading & writing tests. In: Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers. pp. 19–36 (2025)

  12. [12]

    Computers and Education: Artificial Intelligence p

    Hauk, D., Soujon, N.: How reliable are large language models in analyzing the quality of written lesson plans? a mixed-methods study from a teacher internship program. Computers and Education: Artificial Intelligence p. 100538 (2025)

  13. [13]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)

  14. [14]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021)

  15. [15]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  16. [16]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  17. [17]

    nextgenscience.org/resources/equip-rubric-lessons-units-science(2016)

    Inc., A.: Equip rubric for lessons and units: Science.https://www. nextgenscience.org/resources/equip-rubric-lessons-units-science(2016)

  18. [18]

    IEEE Transactions on Learning Technologies17(2024)

    Lee, G.G., Zhai, X.: Using chatgpt for science learning: A study on pre-service teachers’ lesson planning. IEEE Transactions on Learning Technologies17(2024)

  19. [19]

    arXiv preprint arXiv:2403.17281 (2024)

    Li, H., Xu, T., Tang, J., Wen, Q.: Automate knowledge concept tagging on math questions with llms. arXiv preprint arXiv:2403.17281 (2024)

  20. [20]

    In: Proceedings of the 2021 conference on empirical methods in natural language processing

    Li, Z., Tomar, Y., Passonneau, R.J.: A semantic feature-wise transformation re- lation network for automatic short answer grading. In: Proceedings of the 2021 conference on empirical methods in natural language processing. p. 6030 (2021)

  21. [21]

    Transactions of the association for computational linguistics12, 157–173 (2024)

    Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics12, 157–173 (2024)

  22. [22]

    The National Academies Press, Washington, DC (2018) SciEval: A Benchmark for Auto Evaluation for Instructional Materials 15

    National Academies of Sciences, Engineering, and Medicine: Design, Selection, and Implementation of Instructional Materials for the Next Generation Science Stan- dards. The National Academies Press, Washington, DC (2018) SciEval: A Benchmark for Auto Evaluation for Instructional Materials 15

  23. [23]

    National Academies Press, Washington, DC (2012),https://doi.org/10.17226/13165

    National Research Council: A Framework for K–12 Science Education: Practices, Crosscutting Concepts, and Core Ideas. National Academies Press, Washington, DC (2012),https://doi.org/10.17226/13165

  24. [24]

    NextGenScience: Toward ngss design: EQuIP rubric for science detailed guidance (2021),https://ngs.wested.org

  25. [25]

    NGSS Lead States: Next generation science standards: For states, by states (2013), washington, DC: The National Academies Press

  26. [26]

    Journal of Science Teacher Education32(8), 911–933 (2021)

    Nordine, J., Sorge, S., Delen, I., Evans, R., Juuti, K., Lavonen, J., Nilsson, P., Ropohl, M., Stadler, M.: Promoting coherent science instruction through coher- ent science teacher education: A model framework for program design. Journal of Science Teacher Education32(8), 911–933 (2021)

  27. [27]

    Advances in neural information processing sys- tems35, 27730–27744 (2022)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

  28. [28]

    Ouyang,Y.,Quan,J.,Wang,H.,Zeng,Y.,Chen,L.:Lang:Alessonplangeneration framework via multi-form interaction with large language models (2025)

  29. [29]

    Science Education101(4), 520–525 (2017).https://doi.org/10.1002/sce.21248

    Penuel, W.R.: Research–practice partnerships as a strategy for promoting equi- table science teaching and learning through leveraging everyday science. Science Education101(4), 520–525 (2017).https://doi.org/10.1002/sce.21248

  30. [30]

    In: Elements of survey sampling, pp

    Singh, R., Mangat, N.S.: Stratified sampling. In: Elements of survey sampling, pp. 102–144. Springer (1996)

  31. [31]

    ACM Journal of Data and Information Quality17(3), 1–23 (2025)

    Tan, K., Yao, J., Pang, T., Fan, C., Song, Y.: Elf: Educational llm framework of improving and evaluating ai-generated content for classroom teaching. ACM Journal of Data and Information Quality17(3), 1–23 (2025)

  32. [32]

    Advances in neural information processing systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  33. [33]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  34. [34]

    Ye, F., Cui, H., Ke, C., Zhong, W., Zhang, S., Zhang, L.: An automatic model for lessonplansgenerationbasedonlogicalchainsandprompttuning.In:International Symposium on Emerging Technologies for Education. pp. 250–264. Springer (2024)

  35. [35]

    Yuan, B., Hu, J.: An exploration of higher education course evaluation by large language models (2024).https://doi.org/10.48550/arXiv.2411.02455,https: //arxiv.org/abs/2411.02455

  36. [36]

    Humanities and Social Sciences Communications12(1), 1784 (2025)

    Zheng, Y., Huang, S., Zeng, X., Huang, Y., Liu, Z., Luo, W.: Knowledge-enhanced large language models for automatic lesson plan generation. Humanities and Social Sciences Communications12(1), 1784 (2025)

  37. [37]

    In: International Conference on Artificial Intelligence in Education

    Zheng, Y., Li, X., Huang, Y., Liang, Q., Guo, T., Hou, M., Gao, B., Tian, M., Liu, Z., Luo, W.: Automatic lesson plan generation via large language models with self-critique prompting. In: International Conference on Artificial Intelligence in Education. pp. 163–178. Springer (2024)