Recognition: unknown
SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials
Pith reviewed 2026-05-07 16:28 UTC · model grok-4.3
The pith
Fine-tuning large language models on expert-annotated K-12 science lessons improves evaluation performance by up to 11 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formulate Automatic Instructional Materials Evaluation (AIME) as a generative AI task and create the SciEval benchmark consisting of 273 lessons annotated with pedagogy-aligned scores and evidence. Mainstream LLMs show weak performance on SciEval, but domain-aligned fine-tuning of Qwen3 achieves up to 11 percent performance gains on held-out data.
What carries the argument
The SciEval dataset of expert-annotated instructional materials using the EQuIP rubric, which serves as training data to improve LLM predictions of evaluation scores and rationales.
If this is right
- Automated checks on AI-generated lessons could become practical for teachers and schools.
- Similar fine-tuning strategies may enhance LLM performance on other specialized educational tasks.
- Reliable automated evaluation might help maintain quality standards as generative AI use grows in education.
- The benchmark enables further research into improving model reliability for this domain.
Where Pith is reading between the lines
- If successful, such models could be embedded in lesson-planning tools to flag weaknesses early.
- Extending the dataset to more lessons or subjects might broaden the applicability of these gains.
- This work highlights a path for using LLMs to support rather than replace expert judgment in education.
Load-bearing premise
The EQuIP rubric annotations provided by experts form a reliable ground truth that reflects true quality in K-12 science instructional materials, and the 273 lessons sufficiently represent the range of possible content.
What would settle it
A replication study where different experts re-annotate the same lessons and find low agreement with the original scores, or where fine-tuned models fail to outperform base models on a fresh collection of instructional materials from other sources.
Figures
read the original abstract
The need to evaluate instructional materials for K-12 science education has become increasingly important, as more educators use generative AI to create instructional materials. However, the review of instructional materials is time-consuming, expertise-intensive, and difficult to scale, motivating interest in automated evaluation approaches. While large language models (LLMs) have shown strong performance on general evaluation tasks, their performance and reliability on instructional materials remain unclear. To address this gap, we formulate Automatic Instructional Materials Evaluation (AIME) as a generative AI task that predicts scores and evidence using the rubric designed by the educator. We create a benchmark dataset and develop baseline models for AIME. First, we curate the first AIME dataset, SciEval, consisting of instructional materials annotated with pedagogy-aligned evaluation scores and evidence-based rationales. Expert annotations achieve high inter-rater reliability, resulting in a dataset of 273 lesson-level instructional materials evaluated across 13 criteria (N=3549) using the EQuIP rubric. Second, we test mainstream LLMs (GPT, Gemini, Llama, and Qwen) on SciEval and find that none achieve strong performance. Then we fine-tune Qwen3 on SciEval. Results on a held-out test set show that domain-aligned fine-tuning can achieve up to 11 percent performance gains, highlighting the importance of domain-specific fine-tuning for AIME and facilitating the use of LLMs in other educational tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SciEval as the first benchmark dataset for Automatic Instructional Materials Evaluation (AIME), consisting of 273 K-12 science lesson instructional materials annotated by experts on 13 EQuIP rubric criteria (N=3549 annotations). It evaluates mainstream LLMs (GPT, Gemini, Llama, Qwen) on predicting scores and evidence-based rationales, reports that none achieve strong performance, and shows that domain-aligned fine-tuning of Qwen3 on SciEval yields up to 11% performance gains on a held-out test set.
Significance. If the expert annotations are verifiably reliable and the dataset representative, SciEval would provide a valuable, domain-specific resource for scaling evaluation of instructional materials and for training LLMs on education tasks. The empirical result on fine-tuning gains offers a concrete demonstration of the value of domain adaptation in this setting. The absence of numeric reliability metrics and experimental details, however, limits the strength of these contributions.
major comments (3)
- [Abstract] Abstract: the claim that 'Expert annotations achieve high inter-rater reliability' supplies no quantitative values (kappa, ICC, agreement percentages), rater count, or conflict-resolution protocol. This is load-bearing for the central 11% fine-tuning gain claim, as annotation noise of comparable magnitude would render the improvement indistinguishable from label variance.
- [Results] Results section: the 'up to 11 percent performance gains' on the held-out set are reported without defining the exact metric (score accuracy, evidence quality, or composite), absolute baseline and fine-tuned scores, train/validation/test split sizes and sampling method, or statistical significance tests. These details are required to interpret the magnitude and robustness of the improvement.
- [Dataset] Dataset curation section: the 273-lesson corpus is presented without sampling frame, diversity statistics (grade-level or topic distribution), or representativeness justification relative to broader K-12 science materials. This weakens any general claims about LLM performance on instructional material evaluation.
minor comments (2)
- [Abstract] Abstract: the acronym AIME is used without expansion on first mention.
- [Evaluation] Evaluation section: prompting templates and input formatting for the LLM baselines and fine-tuned model should be provided (or linked) to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which have helped us identify areas where the manuscript can be made more transparent and rigorous. We address each major comment below and have revised the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'Expert annotations achieve high inter-rater reliability' supplies no quantitative values (kappa, ICC, agreement percentages), rater count, or conflict-resolution protocol. This is load-bearing for the central 11% fine-tuning gain claim, as annotation noise of comparable magnitude would render the improvement indistinguishable from label variance.
Authors: We agree that the abstract must include quantitative support for the reliability claim to allow proper interpretation of the fine-tuning results. The original manuscript describes the annotation process with three expert raters and reports high agreement in the Dataset section, but these specifics were not quantified in the abstract. In the revision, we have updated the abstract to state the rater count, average agreement percentage, and Cohen's kappa value, and we have added a concise description of the consensus protocol. We have also expanded the Methods section with the full protocol details. This change directly addresses the concern about annotation noise relative to the reported gains. revision: yes
-
Referee: [Results] Results section: the 'up to 11 percent performance gains' on the held-out set are reported without defining the exact metric (score accuracy, evidence quality, or composite), absolute baseline and fine-tuned scores, train/validation/test split sizes and sampling method, or statistical significance tests. These details are required to interpret the magnitude and robustness of the improvement.
Authors: We concur that the Results section requires these specifics for interpretability. The 11% figure refers to improvement in composite score accuracy (combining score prediction and evidence quality). In the revised manuscript, we now explicitly define the metric, report absolute baseline and fine-tuned scores for Qwen3 and other models, describe the 70/15/15 stratified split by grade and topic, and include statistical significance testing (paired t-test, p < 0.05). A new table has been added summarizing all values and the evaluation protocol. revision: yes
-
Referee: [Dataset] Dataset curation section: the 273-lesson corpus is presented without sampling frame, diversity statistics (grade-level or topic distribution), or representativeness justification relative to broader K-12 science materials. This weakens any general claims about LLM performance on instructional material evaluation.
Authors: The referee is correct that additional context strengthens the dataset description. We have revised the Dataset curation section to specify the sampling frame (lessons drawn from open K-12 repositories aligned with NGSS), include diversity statistics (grade-level and topic distributions), and provide a justification of representativeness for typical U.S. K-12 science instructional materials. These additions clarify the scope of our claims about LLM performance without overstating generalizability. revision: yes
Circularity Check
No circularity: purely empirical benchmark and evaluation
full rationale
The paper creates the SciEval dataset by curating 273 lessons annotated by experts using the external EQuIP rubric across 13 criteria, then evaluates mainstream LLMs and fine-tunes Qwen3, reporting performance on a held-out test set. No equations, derivations, fitted parameters renamed as predictions, or self-citations are used to justify any central claim. Results are direct empirical measurements against held-out data and the independent rubric, with no reduction of outputs to inputs by construction. The work is self-contained empirical research.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption EQuIP rubric provides valid and comprehensive criteria for evaluating K-12 science instructional materials
Reference graph
Works this paper leans on
-
[1]
org/10.48550/arXiv.2508.04442,https://arxiv.org/abs/2508.04442
Abdul Wahid, R., Nadim, M.S.N., Sulaiman, S., Shaharudin, S.A., Jupikil, M.D., Su, I.J.S.A.: Automated generation of curriculum-aligned multiple-choice questions for malaysian secondary mathematics using generative ai (2025).https://doi. org/10.48550/arXiv.2508.04442,https://arxiv.org/abs/2508.04442
-
[2]
nextgenscience.org/resources/equip-rubric-lessons-units-science(2016)
Achieve, Inc.: Equip rubric for lessons & units: Science (version 3.0).https://www. nextgenscience.org/resources/equip-rubric-lessons-units-science(2016)
2016
-
[3]
Biological Sciences Curriculum Study (BSCS): Field-test version: Bscs science learning ngss-aligned instructional materials.https://bscs.org/ bscs-science-learning/(2019), accessed August 2025
2019
-
[4]
arXiv preprint arXiv:2405.17284 (2024) 14 Z
Camilli, G.: An nlp crosswalk between the common core state standards and naep item specifications. arXiv preprint arXiv:2405.17284 (2024) 14 Z. Li et al
-
[5]
carnegie.org/news/articles/instructional-materials-matter/(2017)
Carnegie Corporation of New York: Instructional materials matter.https://www. carnegie.org/news/articles/instructional-materials-matter/(2017)
2017
-
[6]
Clark, H.B., Margetts, M., Beaven, T., Braverman, B., Biggart, A.M.: Auto- evaluation: A critical measure in driving improvements in quality and safety of ai-generated lesson resources (2025)
2025
-
[7]
Educational and psycho- logical measurement20(1), 37–46 (1960)
Cohen, J.: A coefficient of agreement for nominal scales. Educational and psycho- logical measurement20(1), 37–46 (1960)
1960
-
[8]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review arXiv 2025
-
[9]
Advances in neural information processing systems36, 10088–10115 (2023)
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient fine- tuning of quantized llms. Advances in neural information processing systems36, 10088–10115 (2023)
2023
-
[10]
arXiv e-prints pp
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)
2024
-
[11]
In: Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers
Fu, Y., Jiao, H., Zhou, T., Zhang, N., Li, M., Xu, Q., Peters, S., Lissitz, R.W.: Text- based approaches to item alignment to content standards in large-scale reading & writing tests. In: Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Coordinated Session Papers. pp. 19–36 (2025)
2025
-
[12]
Computers and Education: Artificial Intelligence p
Hauk, D., Soujon, N.: How reliable are large language models in analyzing the quality of written lesson plans? a mixed-methods study from a teacher internship program. Computers and Education: Artificial Intelligence p. 100538 (2025)
2025
-
[13]
Measuring Massive Multitask Language Understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)
work page internal anchor Pith review arXiv 2009
-
[14]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021)
work page internal anchor Pith review arXiv 2021
-
[15]
ICLR1(2), 3 (2022)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)
2022
-
[16]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review arXiv 2024
-
[17]
nextgenscience.org/resources/equip-rubric-lessons-units-science(2016)
Inc., A.: Equip rubric for lessons and units: Science.https://www. nextgenscience.org/resources/equip-rubric-lessons-units-science(2016)
2016
-
[18]
IEEE Transactions on Learning Technologies17(2024)
Lee, G.G., Zhai, X.: Using chatgpt for science learning: A study on pre-service teachers’ lesson planning. IEEE Transactions on Learning Technologies17(2024)
2024
-
[19]
arXiv preprint arXiv:2403.17281 (2024)
Li, H., Xu, T., Tang, J., Wen, Q.: Automate knowledge concept tagging on math questions with llms. arXiv preprint arXiv:2403.17281 (2024)
-
[20]
In: Proceedings of the 2021 conference on empirical methods in natural language processing
Li, Z., Tomar, Y., Passonneau, R.J.: A semantic feature-wise transformation re- lation network for automatic short answer grading. In: Proceedings of the 2021 conference on empirical methods in natural language processing. p. 6030 (2021)
2021
-
[21]
Transactions of the association for computational linguistics12, 157–173 (2024)
Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics12, 157–173 (2024)
2024
-
[22]
The National Academies Press, Washington, DC (2018) SciEval: A Benchmark for Auto Evaluation for Instructional Materials 15
National Academies of Sciences, Engineering, and Medicine: Design, Selection, and Implementation of Instructional Materials for the Next Generation Science Stan- dards. The National Academies Press, Washington, DC (2018) SciEval: A Benchmark for Auto Evaluation for Instructional Materials 15
2018
-
[23]
National Academies Press, Washington, DC (2012),https://doi.org/10.17226/13165
National Research Council: A Framework for K–12 Science Education: Practices, Crosscutting Concepts, and Core Ideas. National Academies Press, Washington, DC (2012),https://doi.org/10.17226/13165
-
[24]
NextGenScience: Toward ngss design: EQuIP rubric for science detailed guidance (2021),https://ngs.wested.org
2021
-
[25]
NGSS Lead States: Next generation science standards: For states, by states (2013), washington, DC: The National Academies Press
2013
-
[26]
Journal of Science Teacher Education32(8), 911–933 (2021)
Nordine, J., Sorge, S., Delen, I., Evans, R., Juuti, K., Lavonen, J., Nilsson, P., Ropohl, M., Stadler, M.: Promoting coherent science instruction through coher- ent science teacher education: A model framework for program design. Journal of Science Teacher Education32(8), 911–933 (2021)
2021
-
[27]
Advances in neural information processing sys- tems35, 27730–27744 (2022)
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)
2022
-
[28]
Ouyang,Y.,Quan,J.,Wang,H.,Zeng,Y.,Chen,L.:Lang:Alessonplangeneration framework via multi-form interaction with large language models (2025)
2025
-
[29]
Science Education101(4), 520–525 (2017).https://doi.org/10.1002/sce.21248
Penuel, W.R.: Research–practice partnerships as a strategy for promoting equi- table science teaching and learning through leveraging everyday science. Science Education101(4), 520–525 (2017).https://doi.org/10.1002/sce.21248
-
[30]
In: Elements of survey sampling, pp
Singh, R., Mangat, N.S.: Stratified sampling. In: Elements of survey sampling, pp. 102–144. Springer (1996)
1996
-
[31]
ACM Journal of Data and Information Quality17(3), 1–23 (2025)
Tan, K., Yao, J., Pang, T., Fan, C., Song, Y.: Elf: Educational llm framework of improving and evaluating ai-generated content for classroom teaching. ACM Journal of Data and Information Quality17(3), 1–23 (2025)
2025
-
[32]
Advances in neural information processing systems35, 24824–24837 (2022)
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)
2022
-
[33]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
work page internal anchor Pith review arXiv 2025
-
[34]
Ye, F., Cui, H., Ke, C., Zhong, W., Zhang, S., Zhang, L.: An automatic model for lessonplansgenerationbasedonlogicalchainsandprompttuning.In:International Symposium on Emerging Technologies for Education. pp. 250–264. Springer (2024)
2024
-
[35]
Yuan, B., Hu, J.: An exploration of higher education course evaluation by large language models (2024).https://doi.org/10.48550/arXiv.2411.02455,https: //arxiv.org/abs/2411.02455
-
[36]
Humanities and Social Sciences Communications12(1), 1784 (2025)
Zheng, Y., Huang, S., Zeng, X., Huang, Y., Liu, Z., Luo, W.: Knowledge-enhanced large language models for automatic lesson plan generation. Humanities and Social Sciences Communications12(1), 1784 (2025)
2025
-
[37]
In: International Conference on Artificial Intelligence in Education
Zheng, Y., Li, X., Huang, Y., Liang, Q., Guo, T., Hou, M., Gao, B., Tian, M., Liu, Z., Luo, W.: Automatic lesson plan generation via large language models with self-critique prompting. In: International Conference on Artificial Intelligence in Education. pp. 163–178. Springer (2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.