SCALEFeedback: A Large-Scale Dataset of Synthetic Computer Science Assignments for LLM-generated Educational Feedback Research

Dragan Ga\v{s}evi\'c; Flora Jin; Guanliang Chen; Kaixun Yang; Keyang Qian; Lixiang Yan; Rui Guan; Sadia Nawaz; Wei Dai; Yixin Cheng

read the original abstract

Using Large Language Models (LLMs) to give educational feedback to students for their assignments has attracted much attention in the AI in Education (AIED) field. Yet, there is currently no large-scale open-source dataset of student assignments that includes detailed assignment descriptions, rubrics, and student submissions across various courses. As a result, research on generalisable methodology for automatic generation of effective and responsible educational feedback remains limited. In this paper, we introduce a synthetic computer science university assignment dataset for LLM-based educational feedback research, called SCALEFeedback (Synthetic Computer science Assignments for LLM Educational Feedback Research). The dataset is generated via Sophisticated Assignment Mimicry (SAM) framework specifically designed to synthesise this dataset and that utilizes one-to-one LLM-based imitation from real assignment descriptions, rubrics, and student submissions. Our open-source dataset contains 10,000 synthetic student submissions spanning 155 assignments across 59 university-level computer science courses. Technical validation confirmed that the synthetic dataset closely resembles real data while successfully eliminating personally identifiable information present in the source material. The creation of this dataset is a valuable contribution to researchers who aim to develop LLM-based generalisable methods for offering high-quality, automated educational feedback in a scalable way.

SCALEFeedback: A Large-Scale Dataset of Synthetic Computer Science Assignments for LLM-generated Educational Feedback Research

discussion (0)