pith. sign in

arxiv: 2508.05953 · v2 · pith:3LTBNC3Anew · submitted 2025-08-08 · 💻 cs.CY

SCALEFeedback: A Large-Scale Dataset of Synthetic Computer Science Assignments for LLM-generated Educational Feedback Research

classification 💻 cs.CY
keywords dataseteducationalfeedbackassignmentssyntheticassignmentcomputerresearch
0
0 comments X
read the original abstract

Using Large Language Models (LLMs) to give educational feedback to students for their assignments has attracted much attention in the AI in Education (AIED) field. Yet, there is currently no large-scale open-source dataset of student assignments that includes detailed assignment descriptions, rubrics, and student submissions across various courses. As a result, research on generalisable methodology for automatic generation of effective and responsible educational feedback remains limited. In this paper, we introduce a synthetic computer science university assignment dataset for LLM-based educational feedback research, called SCALEFeedback (Synthetic Computer science Assignments for LLM Educational Feedback Research). The dataset is generated via Sophisticated Assignment Mimicry (SAM) framework specifically designed to synthesise this dataset and that utilizes one-to-one LLM-based imitation from real assignment descriptions, rubrics, and student submissions. Our open-source dataset contains 10,000 synthetic student submissions spanning 155 assignments across 59 university-level computer science courses. Technical validation confirmed that the synthetic dataset closely resembles real data while successfully eliminating personally identifiable information present in the source material. The creation of this dataset is a valuable contribution to researchers who aim to develop LLM-based generalisable methods for offering high-quality, automated educational feedback in a scalable way.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.