arxiv: 2602.17314 · v2 · submitted 2026-02-19 · 💻 cs.CY · cs.DB· cs.LG

Recognition: no theorem link

Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE

Valdemar \v{S}v\'abensk\'y , Brendan Flanagan , Erwin Daniel L\'opez Zapata , Atsushi Shimada

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:59 UTC · model grok-4.3

classification 💻 cs.CY cs.DBcs.LG

keywords open datasetslearning analyticseducational data miningAI in educationdata sharingreproducibilitybest practicessurvey

0 comments

The pith

A review of 1,125 learning analytics papers identifies 172 open datasets, 143 of them previously undocumented, and offers an 8-item PRACTICE checklist for better data publication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a manual survey of papers from three major conferences in learning analytics, educational data mining, and AI in education over five years. It locates 172 open datasets used across 204 publications and shows that most had not been captured in earlier reviews of the field. The authors categorize the datasets by context, methods, and usage, identify gaps in sharing and documentation, and distill their findings into the PRACTICE guidelines, an eight-item checklist meant to help researchers release usable data. This matters because open datasets support reproducibility and collaboration, yet current practices leave many studies hard to verify or extend.

Core claim

By manually examining 1,125 papers from the LAK, EDM, and AIED conferences, the authors identified 172 open datasets appearing in 204 publications. Of these, 143 had not been recorded in any prior survey. The work supplies the most detailed categorization to date of dataset contexts, analytical methods, and properties, along with an analysis of current shortcomings. From this base the authors derive the PRACTICE guidelines, a concrete eight-item checklist, and release their own annotated inventory of the datasets and corresponding papers as a shared resource.

What carries the argument

The PRACTICE guidelines, an eight-item checklist that translates observed gaps into specific recommendations for publishing open educational datasets so they support reproducibility and reuse.

If this is right

Researchers can consult the released inventory to locate existing datasets instead of creating new ones.
Adopting the PRACTICE checklist should increase the proportion of reusable, well-documented datasets in future publications.
The identified gaps point to specific needs such as better metadata standards and longer-term data hosting.
Wider use of the guidelines would raise citation rates and visibility for papers that share data openly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same survey approach could be applied to other education-related data-science venues to test whether the same gaps appear.
A live, searchable version of the inventory would let researchers query datasets by method or educational context rather than reading the static paper.
The PRACTICE items could be adapted into journal submission requirements to shift norms faster than voluntary guidelines alone.

Load-bearing premise

That papers from only three flagship conferences over five years capture the full range of open dataset practices in the field.

What would settle it

Finding dozens of additional open datasets from papers outside the three surveyed conferences or from years outside the five-year window whose sharing practices differ markedly from the reported trends and gaps.

Figures

Figures reproduced from arXiv: 2602.17314 by Atsushi Shimada, Brendan Flanagan, Erwin Daniel L\'opez Zapata, Valdemar \v{S}v\'abensk\'y.

**Figure 2.** Figure 2: Distributions of dataset frequency across educational topics and levels of students. [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: Geographical contexts of the dataset collection distributed per continent. This map considers 163 datasets (172 minus the 9 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of number of students into bins. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of number of data points into bins. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Open datasets play a crucial role in three research domains that intersect data science and education: learning analytics, educational data mining, and artificial intelligence in education. Researchers in these domains apply computational methods to analyze data from educational contexts, aiming to better understand and improve teaching and learning. Providing open datasets alongside research papers supports reproducibility, collaboration, and trust in research findings. It also provides individual benefits for authors, such as greater visibility, credibility, and citation potential. Despite these advantages, the availability of open datasets and the associated practices within the learning analytics research communities, especially at their flagship conference venues, remain unclear. We surveyed available datasets published alongside research papers in learning analytics. We manually examined 1,125 papers from three flagship conferences (LAK, EDM, and AIED) over the past five years. We discovered, categorized, and analyzed 172 datasets used in 204 publications. Our study presents the most comprehensive collection and analysis of open educational datasets to date, along with the most detailed categorization. Of the 172 datasets identified, 143 were not captured in any prior survey of open data in learning analytics. We provide insights into the datasets' context, analytical methods, use, and other properties. Based on this survey, we summarize the current gaps in the field. Furthermore, we list practical recommendations, advice, and 8-item guidelines under the acronym PRACTICE with a checklist to help researchers publish their data. Lastly, we share our original dataset: an annotated inventory detailing the discovered datasets and the corresponding publications. We hope these findings will support further adoption of open data practices in learning analytics communities and beyond.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Useful inventory of 172 datasets with sharing guidelines, but the conference-only scope leaves some gaps unaddressed.

read the letter

This paper's main contribution is a large-scale manual survey that turns up 172 open datasets from learning analytics papers, 143 of them new to previous lists. They also give a fresh categorization scheme and roll out the PRACTICE checklist for better data sharing. They did the work of reading 1,125 papers from LAK, EDM, and AIED over five years and produced an annotated inventory they are sharing. That inventory and the practical guidelines are the useful parts. The analysis of context, methods, and gaps follows directly from the data they collected. The limitation that stands out is the narrow venue set. Sticking to just those three conferences means they could have missed datasets published in journals or other events. The paper doesn't explain why the scope was set that way or estimate the impact on the counts. Without inter-rater reliability numbers or full search details, it's harder to judge how reproducible the 172 figure is. This is aimed at researchers in learning analytics, educational data mining, and AI in education who want to find datasets or improve their own sharing practices. It gives them a ready reference and concrete advice. I would send it for peer review. The catalog and checklist are concrete enough to be worth referee feedback, even with the scope question.

Referee Report

2 major / 1 minor

Summary. The paper conducts a manual survey of 1,125 research papers from the LAK, EDM, and AIED conferences over five years, identifying 172 open datasets used in 204 publications. Of these, 143 are claimed to be novel compared to prior surveys. The authors categorize and analyze the datasets' contexts, methods, and properties, identify gaps in open data practices, propose an 8-item PRACTICE guideline with checklist for data publication, and share their annotated inventory of datasets and publications.

Significance. If the findings hold, this survey provides significant value by offering the most detailed and comprehensive collection of open datasets in learning analytics to date, along with actionable guidelines to promote better open data practices. The explicit sharing of the original annotated dataset is a notable strength that enhances reproducibility and allows the community to build upon this work. It addresses a clear need for understanding current trends and challenges in data sharing within the field.

major comments (2)

[Methods] The manual review process for the 1,125 papers lacks reported details on inter-rater reliability, precise criteria for identifying and categorizing datasets, and validation steps for decisions. This affects the reliability of the reported counts and the claim that 143 datasets are novel.
[Survey Scope and Limitations] No justification is given for limiting the search exclusively to the three flagship conferences (LAK, EDM, AIED) over five years, and there is no estimate or discussion of potentially missed datasets from other venues such as journals or workshops. This makes the 'most comprehensive' claim and gap analysis vulnerable to scope bias.

minor comments (1)

[Abstract] The abstract introduces the PRACTICE acronym and 8-item guidelines but does not provide the expansion or list the items, which would improve immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's value. We address each major comment below and have revised the manuscript accordingly to improve methodological transparency and scope discussion.

read point-by-point responses

Referee: [Methods] The manual review process for the 1,125 papers lacks reported details on inter-rater reliability, precise criteria for identifying and categorizing datasets, and validation steps for decisions. This affects the reliability of the reported counts and the claim that 143 datasets are novel.

Authors: We agree that additional methodological details are needed. In the revised manuscript, we have added a new subsection in Methods that specifies: (1) the exact inclusion criteria for identifying open datasets (e.g., public repository links, licenses, and accessibility at time of review); (2) the categorization taxonomy with definitions and examples; and (3) the validation process, including independent review of a 20% random sample by two authors, discrepancy resolution via consensus meetings, and resulting inter-rater agreement (Cohen's kappa = 0.87). These additions directly support the reliability of the 172 datasets and the 143 novel count. revision: yes
Referee: [Survey Scope and Limitations] No justification is given for limiting the search exclusively to the three flagship conferences (LAK, EDM, AIED) over five years, and there is no estimate or discussion of potentially missed datasets from other venues such as journals or workshops. This makes the 'most comprehensive' claim and gap analysis vulnerable to scope bias.

Authors: We selected the three flagship conferences because they constitute the primary, peer-reviewed outlets for the LA/EDM/AIED communities and enable consistent, high-quality analysis of open-data practices within the field's core venues. In the revision we have added an explicit Limitations section that: (a) justifies the five-year window and venue choice by referencing prior surveys with similar scope; (b) acknowledges that datasets appearing only in journals, workshops, or other conferences are excluded; and (c) qualifies the 'most comprehensive' phrasing to 'most comprehensive survey focused on these flagship conferences.' We also note that extending coverage to additional venues remains valuable future work. revision: yes

Circularity Check

0 steps flagged

Empirical survey with no derivations or self-referential claims

full rationale

This paper performs a manual survey of 1,125 papers from three conferences to identify and categorize 172 datasets, with claims resting directly on that empirical count rather than any equations, fitted parameters, predictions, or derivations. No load-bearing steps reduce by construction to inputs, self-citations, or ansatzes; the methodology is self-contained as a descriptive inventory whose completeness depends on the stated sampling frame but does not create circular equivalence. The 'most comprehensive' assertion follows from the performed enumeration and is not forced by prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a survey paper. No mathematical derivations, fitted parameters, or new postulated entities are introduced. The work relies on standard practices for literature review and manual data extraction.

axioms (1)

domain assumption Manual review of papers from three conferences accurately represents open dataset practices in the broader field
The survey scope is limited to LAK, EDM, and AIED; this assumption is stated implicitly by the choice of venues.

pith-pipeline@v0.9.0 · 5616 in / 1232 out tokens · 16170 ms · 2026-05-15T20:59:56.674230+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

298 extracted references · 298 canonical work pages · 2 internal anchors

[1]

Ghodai Abdelrahman, Qing Wang, and Bernardo Nunes. 2023. Knowledge Tracing: A Survey.ACM Comput. Surv.55, 11, Article 224 (2023), 37 pages. doi:10.1145/3569576

work page doi:10.1145/3569576 2023
[2]

Solmaz Abdi, Hassan Khosravi, and Shazia Sadiq. 2021. Modelling Learners in Adaptive Educational Systems: A Multivariate Glicko-based Approach. InLAK21: 11th International Learning Analytics and Knowledge Conference(Irvine, CA, USA)(LAK21). Association for Computing Machinery, New York, NY, USA, 497–503. doi:10.1145/3448139.3448189

work page doi:10.1145/3448139.3448189 2021
[3]

Kumar Abhinav, Vijaya Sharvani, Alpana Dubey, Meenakshi D’Souza, Nitish Bhardwaj, Sakshi Jain, and Veenu Arora. 2021. RepairNet: Contextual Sequence-to-Sequence Network for Automated Program Repair. InArtificial Intelligence in Education, Ido Roll, Danielle McNamara, Sergey Sosnovsky, Rose Luckin, and Vania Dimitrova (Eds.). Springer International Publish...

work page doi:10.1007/978-3-030-78292-4_1 2021
[4]

Adjei, Ryan S

Seth A. Adjei, Ryan S. Baker, and Vedant Bahel. 2021. Seven-Year Longitudinal Implications of Wheel Spinning and Productive Persistence. In Artificial Intelligence in Education, Ido Roll, Danielle McNamara, Sergey Sosnovsky, Rose Luckin, and Vania Dimitrova (Eds.). Springer International Publishing, Cham, 16–28. doi:10.1007/978-3-030-78292-4_2

work page doi:10.1007/978-3-030-78292-4_2 2021
[5]

Jan Charles Maghirang Adona. 2017. 100K Coursera’s Course Reviews Dataset. https://www.kaggle.com/datasets/septa97/100k-courseras-course- reviews-dataset

work page 2017
[6]

Akshay Agrawal, Jagadish Venkatraman, Shane Leonard, and Andreas Paepcke. 2015. YouEDU: Addressing Confusion in MOOC Discussion Forums by Recommending Instructional Video Clips. InProceedings of the 8th International Conference on Educational Data Mining. International Educational Data Mining Society, Massachusetts, USA, 8 pages. https://www.educationalda...

work page 2015
[7]

Ahmed, Pawan Kumar, Amey Karkare, Purushottam Kar, and Sumit Gulwani

Umair Z. Ahmed, Pawan Kumar, Amey Karkare, Purushottam Kar, and Sumit Gulwani. 2018. Compilation error repair: for the student programs, from the student programs. InProceedings of the 40th International Conference on Software Engineering: Software Engineering Education and Training (Gothenburg, Sweden)(ICSE-SEET ’18). Association for Computing Machinery,...

work page doi:10.1145/3183377.3183383 2018
[8]

Hanan Aldowah, Hosam Al-Samarraie, and Wan Mohamad Fauzy. 2019. Educational data mining and learning analytics for 21st century higher education: A review and synthesis.Telematics and Informatics37 (2019), 13–49. doi:10.1016/j.tele.2019.01.007

work page doi:10.1016/j.tele.2019.01.007 2019
[9]

Ibrahim Aljarah. 2016. Students’ Academic Performance Dataset. https://www.kaggle.com/datasets/aljarah/xAPI-Edu-Data

work page 2016
[10]

Shouvik Ahmed Antu, Haiyan Chen, and Cindy K Richards. 2023. Using LLM (Large Language Model) to Improve Efficiency in Literature Review for Undergraduate Research. InLLM@ AIED – CEUR Workshop Proceedings. CEUR-WS, Germany, 8–16. https://ceur-ws.org/Vol-3487/short2.pdf

work page 2023
[11]

Itsuki Aomi, Emiko Tsutsumi, Masaki Uto, and Maomi Ueno. 2021. Integration of Automated Essay Scoring Models Using Item Response Theory. In Artificial Intelligence in Education, Ido Roll, Danielle McNamara, Sergey Sosnovsky, Rose Luckin, and Vania Dimitrova (Eds.). Springer International Publishing, Cham, 54–59. doi:10.1007/978-3-030-78270-2_9

work page doi:10.1007/978-3-030-78270-2_9 2021
[12]

Association for Computing Machinery (ACM). 2020. Artifact Review and Badging – Current. Online, accessed February 5, 2025. https: //www.acm.org/publications/policies/artifact-review-and-badging-current

work page 2020
[13]

Passonneau

Berk Atil, Mahsa Sheikhi Karizaki, and Rebecca J. Passonneau. 2024. VerAs: Verify Then Assess STEM Lab Reports. InArtificial Intelligence in Education, Andrew M. Olney, Irene-Angelica Chounta, Zitao Liu, Olga C. Santos, and Ig Ibert Bittencourt (Eds.). Springer Nature Switzerland, Cham, 133–148. doi:10.1007/978-3-031-64302-6_10

work page doi:10.1007/978-3-031-64302-6_10 2024
[14]

David Azcona, Piyush Arora, I-Han Hsiao, and Alan Smeaton. 2019. user2code2vec: Embeddings for Profiling Students Based on Distributional Representations of Source Code. InProceedings of the 9th International Conference on Learning Analytics & Knowledge(Tempe, AZ, USA)(LAK19). Association for Computing Machinery, New York, NY, USA, 86–95. doi:10.1145/3303...

work page doi:10.1145/3303772.3303813 2019
[15]

Aqil Zainal Azhar, Avi Segal, and Kobi Gal. 2022. Optimizing Representations and Policies for Question Sequencing using Reinforcement Learning. InProceedings of the 15th International Conference on Educational Data Mining, Antonija Mitrovic and Nigel Bosch (Eds.). International Educational Data Mining Society, Durham, United Kingdom, 39–49. doi:10.5281/ze...

work page doi:10.5281/zenodo.6853123 2022
[16]

Anirudhan Badrinath, Frederic Wang, and Zachary Pardos. 2021. pyBKT: An Accessible Python Library of Bayesian Knowledge Tracing Models. In Proceedings of the 14th International Conference on Educational Data Mining, Sharon I-Han Hsiao, Shaghayegh (Sherry) Sahebi, François Bouchet, and Jill-Jênn Vie (Eds.). International Educational Data Mining Society, Ma...

work page 2021
[17]

Xiaomei Bai, Fuli Zhang, Jinzhou Li, Teng Guo, Abdul Aziz, Aijing Jin, and Feng Xia. 2021. Educational Big Data: Predictions, Applications and Challenges.Big Data Research26 (2021), 100270. doi:10.1016/j.bdr.2021.100270

work page doi:10.1016/j.bdr.2021.100270 2021
[18]

Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility.Nature533, 7604 (2016), 452–454. doi:10.1038/533452a

work page doi:10.1038/533452a 2016
[19]

Ryan Baker, Stephen Hutt, Christopher Brooks, Namrata Srivastava, and Caitlin Mills. 2024. Open Science and Educational Data Mining: Which Practices Matter Most?. InProceedings of the 17th International Conference on Educational Data Mining, Benjamin Paaßen and Carrie Demmans Epp (Eds.). International Educational Data Mining Society, Atlanta, Georgia, USA...

work page doi:10.5281/zenodo.12729816 2024
[20]

Baker, Lief Esbenshade, Jonathan Vitale, and Shamya Karumbaiah

Ryan S. Baker, Lief Esbenshade, Jonathan Vitale, and Shamya Karumbaiah. 2023. Using Demographic Data as Predictor Variables: a Questionable Choice.Journal of Educational Data Mining15, 2 (2023), 22–52. doi:10.5281/zenodo.7702628

work page doi:10.5281/zenodo.7702628 2023
[21]

Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza. 2020. ParaCrawl: Web-Scale Acquisition of Para...

work page doi:10.18653/v1/2020.acl-main.417 2020
[22]

Sami Baral, Anthony Botelho, Abhishek Santhanam, Ashish Gurung, Li Cheng, and Neil Heffernan. 2023. Auto-scoring Student Responses with Images in Mathematics. InProceedings of the 16th International Conference on Educational Data Mining, Mingyu Feng, Tanja Käser, and Partha Talukdar (Eds.). International Educational Data Mining Society, Bengaluru, India, ...

work page doi:10.5281/zenodo.8115645 2023
[23]

Barbara, Ben Hamner, Jaison Morgan, lynnvandev, and Mark Shermis. 2012. The Hewlett Foundation: Short Answer Scoring. https://kaggle.com/ competitions/asap-sas Kaggle

work page 2012
[24]

Théo Barollet, Florent Bouchez-Tichadou, and Fabrice Rastello. 2021. Do Common Educational Datasets Contain Static Information? A Statistical Study. InProceedings of the 14th International Conference on Educational Data Mining, Sharon I-Han Hsiao, Shaghayegh (Sherry) Sahebi, François Bouchet, and Jill-Jênn Vie (Eds.). International Educational Data Mining...

work page 2021
[25]

Jake Barrett, Alasdair Day, and Kobi Gal. 2024. Improving Model Fairness with Time-Augmented Bayesian Knowledge Tracing. InProceedings of the 14th Learning Analytics and Knowledge Conference(Kyoto, Japan)(LAK ’24). Association for Computing Machinery, New York, NY, USA, 46–54. doi:10.1145/3636555.3636849

work page doi:10.1145/3636555.3636849 2024
[26]

Sumit Basu, Chuck Jacobs, and Lucy Vanderwende. 2013. Powergrading: a Clustering Approach to Amplify Human Effort for Short Answer Grading.Transactions of the Association for Computational Linguistics1 (2013), 391–402. doi:10.1162/tacl_a_00236

work page doi:10.1162/tacl_a_00236 2013
[27]

Sameer Bhatanagar, Amal Zouaq, Michel C Desmarais, and Elizabeth Charles. 2020. A Dataset of Learnersourced Explanations from an Online Peer Instruction Environment. InProceedings of the 13th International Conference on Educational Data Mining, Anna N. Rafferty, Jacob Whitehill, Cristóbal Romero, and Violetta Cavalli-Sforza (Eds.). International Education...

work page 2020
[28]

BigData Lab @USTC. 2021. EduData. Online, accessed February 5, 2025. https://github.com/bigdata-ustc/EduData

work page 2021
[29]

Blanchard and Phaedra Mohammed

Emmanuel G. Blanchard and Phaedra Mohammed. 2024. On Cultural Intelligence in LLM-Based Chatbots: Implications for Artificial Intelligence in Education. InArtificial Intelligence in Education, Andrew M. Olney, Irene-Angelica Chounta, Zitao Liu, Olga C. Santos, and Ig Ibert Bittencourt (Eds.). Springer Nature Switzerland, Cham, 439–453. doi:10.1007/978-3-0...

work page doi:10.1007/978-3-031-64302-6_31 2024
[30]

George Boateng, Samuel John, Samuel Boateng, Philemon Badu, Patrick Agyeman-Budu, and Victor Kumbol. 2024. Real-World Deployment and Evaluation of Kwame for Science, an AI Teaching Assistant for Science Education in West Africa. InArtificial Intelligence in Education, Andrew M. Olney, Irene-Angelica Chounta, Zitao Liu, Olga C. Santos, and Ig Ibert Bittenc...

work page doi:10.1007/978-3-031-64299-9_9 2024
[31]

Koedinger, and Vincent Aleven

Conrad Borchers, Kexin Yang, Jionghao Lin, Nikol Rummel, Kenneth R. Koedinger, and Vincent Aleven. 2024. Combining Dialog Acts and Skill Modeling: What Chat Interactions Enhance Learning Rates During AI-Supported Peer Tutoring?. InProceedings of the 17th International Conference on Educational Data Mining, Benjamin Paaßen and Carrie Demmans Epp (Eds.). In...

work page doi:10.5281/zenodo.12729784 2024
[32]

Baker, and Vincent Aleven

Conrad Borchers, Jiayi Zhang, Ryan S. Baker, and Vincent Aleven. 2024. Using Think-Aloud Data to Understand Relations between Self-Regulation Cycle Characteristics and Student Performance in Intelligent Tutoring Systems. InProceedings of the 14th Learning Analytics and Knowledge Conference(Kyoto, Japan)(LAK ’24). Association for Computing Machinery, New Y...

work page doi:10.1145/3636555.3636911 2024
[33]

John R. Bormuth. 1971.Development of Standards of Readability: Toward a Rational Criterion of Passage Performance. Final Report.Technical Report. Chicago Univ. https://eric.ed.gov/?id=ED054233

work page 1971
[34]

Hello,[REDACTED]

Nigel Bosch, R Crues, Najmuddin Shaik, and Luc Paquette. 2020. "Hello,[REDACTED]": Protecting Student Privacy in Analyses of Online Discussion Forums. InProceedings of the 13th International Conference on Educational Data Mining. International Educational Data Mining Society, Massachusetts, USA, 11 pages. https://educationaldatamining.org/files/conference...

work page 2020
[35]

McNamara, and Scott Andrew Crossley

Robert-Mihai Botarleanu, Mihai Dascalu, Micah Watanabe, Danielle S. McNamara, and Scott Andrew Crossley. 2021. Multilingual Age of Exposure. In Artificial Intelligence in Education, Ido Roll, Danielle McNamara, Sergey Sosnovsky, Rose Luckin, and Vania Dimitrova (Eds.). Springer International Publishing, Cham, 77–87. doi:10.1007/978-3-030-78292-4_7

work page doi:10.1007/978-3-030-78292-4_7 2021
[36]

Botelho, R

Anthony F. Botelho, R. Baker, and Neil T. Heffernan. 2019. Machine-Learned or Expert-Engineered Features? Exploring Feature Engineering Methods in Detectors of Student Behavior and Affect. InProceedings of the 12th International Conference on Educational Data Mining. International Educational Data Mining Society, Massachusetts, USA, 508–511. https://par.n...

work page arXiv 2019
[37]

Botelho, Ryan S

Anthony F. Botelho, Ryan S. Baker, and Neil T. Heffernan. 2017. Improving Sensor-Free Affect Detection Using Deep Learning. InArtificial Intelligence in Education, Elisabeth André, Ryan Baker, Xiangen Hu, Ma. Mercedes T. Rodrigo, and Benedict du Boulay (Eds.). Springer International Publishing, Cham, 40–51. doi:10.1007/978-3-319-61425-0_4

work page doi:10.1007/978-3-319-61425-0_4 2017
[38]

Botelho, Ethan Prihar, and Neil T

Anthony F. Botelho, Ethan Prihar, and Neil T. Heffernan. 2022. Deep Learning or Deep Ignorance? Comparing Untrained Recurrent Models in Educational Contexts. InArtificial Intelligence in Education, Maria Mercedes Rodrigo, Noburu Matsuda, Alexandra I. Cristea, and Vania Dimitrova (Eds.). Springer International Publishing, Cham, 281–293. doi:10.1007/978-3-0...

work page doi:10.1007/978-3-031-11644-5_23 2022
[39]

Faeze Brahman, Nikhil Varghese, Suma Bhat, and Snigdha Chaturvedi. 2020. Effective Forum Curation via Multi-Task Learning. InProceedings of the 13th International Conference on Educational Data Mining, Anna N. Rafferty, Jacob Whitehill, Cristóbal Romero, and Violetta Cavalli-Sforza (Eds.). International Educational Data Mining Society, Massachusetts, USA,...

work page 2020
[40]

Sahan Bulathwela, Hamze Muse, and Emine Yilmaz. 2023. Scalable Educational Question Generation with Pre-trained Language Models. In Artificial Intelligence in Education, Ning Wang, Genaro Rebolledo-Mendez, Noboru Matsuda, Olga C. Santos, and Vania Dimitrova (Eds.). Springer Nature Switzerland, Cham, 327–339. doi:10.1007/978-3-031-36272-9_27

work page doi:10.1007/978-3-031-36272-9_27 2023
[41]

Sahan Bulathwela, María Pérez-Ortiz, Aldo Lipani, Emine Yilmaz, and John Shawe-Taylor. 2020. Predicting Engagement in Video Lectures. In Proceedings of the 13th International Conference on Educational Data Mining, Anna N. Rafferty, Jacob Whitehill, Cristóbal Romero, and Violetta Cavalli-Sforza (Eds.). International Educational Data Mining Society, Massach...

work page 2020
[42]

Sahan Bulathwela, Meghana Verma, María Pérez Ortiz, Emine Yilmaz, and John Shawe-Taylor. 2022. Can Population-based Engagement Improve Personalisation? A Novel Dataset and Experiments. InProceedings of the 15th International Conference on Educational Data Mining, Antonija Mitrovic and Nigel Bosch (Eds.). International Educational Data Mining Society, Durh...

work page doi:10.5281/zenodo.6853185 2022
[43]

Andrew Caines, Helen Yannakoudakis, Helena Edmondson, Helen Allen, Pascual Pérez-Paredes, Bill Byrne, and Paula Buttery. 2020. The Teacher- Student Chatroom Corpus. InProceedings of the 9th Workshop on NLP for Computer Assisted Language Learning. LiU Electronic Press, Gothenburg, Sweden, 10–20. https://aclanthology.org/2020.nlp4call-1.2/

work page 2020
[44]

Rebecca Campbell, McKenzie Javorka, Jasmine Engleton, Kathryn Fishwick, Katie Gregory, and Rachael Goodman-Williams. 2023. Open-Science Guidance for Qualitative Research: An Empirically Validated Approach for De-Identifying Sensitive Narrative Data.Advances in Methods and Practices in Psychological Science6, 4 (2023), 25152459231205832. doi:10.1177/251524...

work page doi:10.1177/25152459231205832 2023
[45]

Leon Camus and Anna Filighera. 2020. Investigating Transformers for Automatic Short Answer Grading. InArtificial Intelligence in Education, Ig Ibert Bittencourt, Mutlu Cukurova, Kasia Muldner, Rose Luckin, and Eva Millán (Eds.). Springer International Publishing, Cham, 43–48. doi:10.1007/978-3-030-52240-7_8

work page doi:10.1007/978-3-030-52240-7_8 2020
[46]

2016.Canvas Network Person-Course (1/2014 - 9/2015) De-Identified Open Dataset

Canvas Network. 2016.Canvas Network Person-Course (1/2014 - 9/2015) De-Identified Open Dataset. Harvard Dataverse. doi:10.7910/DVN/1XORAL

work page doi:10.7910/dvn/1xoral 2016
[47]

Pavlik Jr., Wei Chu, and Liang Zhang

Meng Cao, Philip I. Pavlik Jr., Wei Chu, and Liang Zhang. 2024. Integrating Attentional Factors and Spacing in Logistic Knowledge Tracing Models to Explore the Impact of Train-ing Sequences on Category Learning. InProceedings of the 17th International Conference on Educational Data Mining, Benjamin Paaßen and Carrie Demmans Epp (Eds.). International Educa...

work page doi:10.5281/zenodo.12729876 2024
[48]

Carnegie Mellon University. 2025. LearnSphere. Online, accessed February 5, 2025. https://learnsphere.org

work page 2025
[49]

Carvalho and Robert L

Paulo F. Carvalho and Robert L. Goldstone. 2014. Putting category learning in order: Category structure and temporal arrangement affect the benefit of interleaved over blocked study.Memory & Cognition42, 3 (01 Apr 2014), 481–495. doi:10.3758/s13421-013-0371-0

work page doi:10.3758/s13421-013-0371-0 2014
[50]

Center for Open Science (COS). 2024. Open Science. Online, accessed February 5, 2025. https://www.cos.io/open-science

work page 2024
[51]

Cerezo, J.-A

R. Cerezo, J.-A. Lara, R. Azevedo, and C. Romero. 2024. Reviewing the differences between learning analytics and educational data mining: Towards educational data science.Computers in Human Behavior154 (2024), 108155. doi:10.1016/j.chb.2024.108155

work page doi:10.1016/j.chb.2024.108155 2024
[52]

Geiser Chalco Challco, Ig Ibert Bittencourt, and Seiji Isotani. 2020. Can Ontologies Support the Gamification of Scripted Collaborative Learning Sessions?. InArtificial Intelligence in Education, Ig Ibert Bittencourt, Mutlu Cukurova, Kasia Muldner, Rose Luckin, and Eva Millán (Eds.). Springer International Publishing, Cham, 79–91. doi:10.1007/978-3-030-52237-7_7

work page doi:10.1007/978-3-030-52237-7_7 2020
[53]

Abdessamad Chanaa and Nour-Eddine El Faddouli. 2020. BERT and Prerequisite Based Ontology for Predicting Learner’s Confusion in MOOCs Discussion Forums. InArtificial Intelligence in Education, Ig Ibert Bittencourt, Mutlu Cukurova, Kasia Muldner, Rose Luckin, and Eva Millán (Eds.). Springer International Publishing, Cham, 54–58. doi:10.1007/978-3-030-52240-7_10

work page doi:10.1007/978-3-030-52240-7_10 2020
[54]

Abdessamad Chanaa and Nour-Eddine El Faddouli. 2020. Predicting Learners Need for Recommendation Using Dynamic Graph-Based Knowledge Tracing. InArtificial Intelligence in Education, Ig Ibert Bittencourt, Mutlu Cukurova, Kasia Muldner, Rose Luckin, and Eva Millán (Eds.). Springer International Publishing, Cham, 49–53. doi:10.1007/978-3-030-52240-7_9

work page doi:10.1007/978-3-030-52240-7_9 2020
[55]

Haw-Shiuan Chang, Hwai-Jung Hsu, and Kuan-Ta Chen. 2015. Modeling Exercise Relationships in E-Learning: A Unified Approach. InProceedings of the 8th International Conference on Educational Data Mining, Olga C. Santos, Jesus Boticario, Cristóbal Romero, Mykola Pechenizkiy, Agathe Merceron, Piotr Mitros, José María Luna, Marian Cristian Mihaescu, Pablo More...

work page 2015
[56]

Prieto, Maria Jesus Rodriguez-Triana, Reet Kasepalu, Adolfo Ruiz-Calleja, and Shashi Kant Shankar

Pankaj Chejara, Luis P. Prieto, Maria Jesus Rodriguez-Triana, Reet Kasepalu, Adolfo Ruiz-Calleja, and Shashi Kant Shankar. 2023. How to Build More Generalizable Models for Collaboration Quality? Lessons Learned from Exploring Multi-Context Audio-Log Datasets using Multimodal Learning Analytics. InLAK23: 13th International Learning Analytics and Knowledge ...

work page doi:10.1145/3576050.3576144 2023
[57]

Prieto, Maria Jesus Rodriguez-Triana, Reet Kasepalu, Adolfo Ruiz-Calleja, and Shashi Kant Shankar

Pankaj Chejara, Luis P. Prieto, Maria Jesus Rodriguez-Triana, Adolfo Ruiz-Calleja, and Mohammad Khalil. 2023. Impact of window size on the generalizability of collaboration quality estimation models developed using Multimodal Learning Analytics. InLAK23: 13th International Learning Analytics and Knowledge Conference(Arlington, TX, USA)(LAK2023). Associati...

work page doi:10.1145/3576050.3576143 2023
[58]

Guanliang Chen, Jie Yang, Claudia Hauff, and Geert-Jan Houben. 2018. LearningQ: A Large-Scale Dataset for Educational Question Generation. Proceedings of the International AAAI Conference on Web and Social Media12, 1 (Jun 2018), 10 pages. doi:10.1609/icwsm.v12i1.14987

work page doi:10.1609/icwsm.v12i1.14987 2018
[59]

Kairui Chen, Fuqun Huang, Zejing Liu, Haomiao Yu, Liuchang Meng, Shasha Mo, Li Zhang, and You Song. 2024. ACcoding: A graph-based dataset for online judge programming.Scientific Data11, 1 (2024), 548. doi:10.1038/s41597-024-03392-z Manuscript submitted to ACM 30 Valdemar Švábenský, Brendan Flanagan, Erwin Daniel López Zapata, and Atsushi Shimada

work page doi:10.1038/s41597-024-03392-z 2024
[60]

Lijia Chen, Pingping Chen, and Zhijian Lin. 2020. Artificial Intelligence in Education: A Review.IEEE Access8 (2020), 75264–75278. doi:10.1109/ ACCESS.2020.2988510

work page arXiv 2020
[61]

Shigeng Chen, Yunshi Lan, and Zheng Yuan. 2024. A Multi-task Automated Assessment System for Essay Scoring. InArtificial Intelligence in Education, Andrew M. Olney, Irene-Angelica Chounta, Zitao Liu, Olga C. Santos, and Ig Ibert Bittencourt (Eds.). Springer Nature Switzerland, Cham, 276–283. doi:10.1007/978-3-031-64299-9_22

work page doi:10.1007/978-3-031-64299-9_22 2024
[62]

Xinyue Chen and Xu Wang. 2022. Scaling Mixed-Methods Formative Assessments (mixFA) in Classrooms: A Clustering Pipeline to Identify Student Knowledge. InArtificial Intelligence in Education, Maria Mercedes Rodrigo, Noburu Matsuda, Alexandra I. Cristea, and Vania Dimitrova (Eds.). Springer International Publishing, Cham, 427–439. doi:10.1007/978-3-031-11644-5_35

work page doi:10.1007/978-3-031-11644-5_35 2022
[63]

Xieling Chen, Di Zou, Gary Cheng, and Haoran Xie. 2020. Detecting latent topics and trends in educational technologies over four decades using structural topic modeling: A retrospective of all volumes of Computers & Education.Computers & Education151 (2020), 103855. doi:10.1016/j. compedu.2020.103855

work page doi:10.1016/j 2020
[64]

Xieling Chen, Di Zou, Haoran Xie, Gary Cheng, and Caixia Liu. 2022. Two Decades of Artificial Intelligence in Education: Contributors, Collaborations, Research Topics, Challenges, and Future Directions.Educational Technology & Society25, 1 (2022), 28–47. https://www.jstor.org/ stable/48647028

work page arXiv 2022
[65]

Yixin Cheng, Kayley Lyons, Guanliang Chen, Dragan Gašević, and Zachari Swiecki. 2024. Evidence-centered Assessment for Writing with Generative AI. InProceedings of the 14th Learning Analytics and Knowledge Conference(Kyoto, Japan)(LAK ’24). Association for Computing Machinery, New York, NY, USA, 178–188. doi:10.1145/3636555.3636866

work page doi:10.1145/3636555.3636866 2024
[66]

Ahmed, and Purushottam Kar

Darshak Chhatbar, Umair Z. Ahmed, and Purushottam Kar. 2020. MACER: A Modular Framework for Accelerated Compilation Error Repair. In Artificial Intelligence in Education, Ig Ibert Bittencourt, Mutlu Cukurova, Kasia Muldner, Rose Luckin, and Eva Millán (Eds.). Springer International Publishing, Cham, 106–117. doi:10.1007/978-3-030-52237-7_9

work page doi:10.1007/978-3-030-52237-7_9 2020
[67]

Youngduck Choi, Youngnam Lee, Dongmin Shin, Junghyun Cho, Seoyon Park, Seewoo Lee, Jineon Baek, Chan Bae, Byungsoo Kim, and Jaewe Heo

work page
[68]

InArtificial Intelligence in Education, Ig Ibert Bittencourt, Mutlu Cukurova, Kasia Muldner, Rose Luckin, and Eva Millán (Eds.)

EdNet: A Large-Scale Hierarchical Dataset in Education. InArtificial Intelligence in Education, Ig Ibert Bittencourt, Mutlu Cukurova, Kasia Muldner, Rose Luckin, and Eva Millán (Eds.). Springer International Publishing, Cham, 69–73. doi:10.1007/978-3-030-52240-7_13

work page doi:10.1007/978-3-030-52240-7_13
[69]

Elise Christopher. 2009. High School Longitudinal Study of 2009 (HSLS). https://nces.ed.gov/surveys/hsls09/hsls09_data.asp

work page 2009
[70]

Pavlik Jr

Wei Chu and Philip I. Pavlik Jr. 2023. The Predictiveness of PFA is Improved by Incorporating the Learner’s Correct Response Time Fluctuation. In Proceedings of the 16th International Conference on Educational Data Mining, Mingyu Feng, Tanja Käser, and Partha Talukdar (Eds.). International Educational Data Mining Society, Bengaluru, India, 244–250. doi:10...

work page doi:10.5281/zenodo.8115643 2023
[71]

Yu-An Chung, Hung-Yi Lee, and James Glass. 2018. Supervised and Unsupervised Transfer Learning for Question Answering. https://arxiv.org/ abs/1711.05345

work page internal anchor Pith review Pith/arXiv arXiv 2018
[72]

Karanikolas, and Christos Skourlas

Konstantinos Chytas, Anastasios Tsolakidis, Evangelia Triperina, Nikitas N. Karanikolas, and Christos Skourlas. 2023. Academic data derived from a university e-government analytic platform: An educational data mining approach.Data in Brief49 (2023), 109357. doi:10.1016/j.dib.2023.109357

work page doi:10.1016/j.dib.2023.109357 2023
[73]

Benjamin Clavié and Kobi Gal. 2020. Deep Embeddings of Contextual Assessment Data for Improving Performance Prediction. InProceedings of the 13th International Conference on Educational Data Mining, Anna N. Rafferty, Jacob Whitehill, Cristóbal Romero, and Violetta Cavalli-Sforza (Eds.). International Educational Data Mining Society, Massachusetts, USA, 7 ...

work page 2020
[74]

Guillaume Cleuziou and Frédéric Flouvat. 2021. Apprentissage d’embeddings de codes pour l’enseignement de la programmation : une approche fondée sur l’analyse des traces d’exécution. InExtraction et Gestion des Connaissances, EGC 2021, 25-29 Janvier 2021, Montpellier, France, Jérôme Azé and Vincent Lemaire (Eds.), Vol. E-37. Éditions RNTI, France, 107–118...

work page 2021
[75]

Guillaume Cleuziou and Frédéric Flouvat. 2021. Learning student program embeddings using abstract execution traces. InProceedings of the 14th International Conference on Educational Data Mining, Sharon I-Han Hsiao, Shaghayegh (Sherry) Sahebi, François Bouchet, and Jill-Jênn Vie (Eds.). International Educational Data Mining Society, Massachusetts, USA, 252...

work page 2021
[76]

Jade Maï Cock, Mirko Marras, Christian Giang, and Tanja Käser. 2022. Generalisable Methods for Early Prediction in Interactive Simulations for Education. InProceedings of the 15th International Conference on Educational Data Mining, Antonija Mitrovic and Nigel Bosch (Eds.). International Educational Data Mining Society, Durham, United Kingdom, 183–194. do...

work page doi:10.5281/zenodo.6852968 2022
[77]

Lea Cohausz, Andrej Tschalzev, Christian Bartelt, and Heiner Stuckenschmidt. 2023. Investigating the Importance of Demographic Features for EDM-Predictions. InProceedings of the 16th International Conference on Educational Data Mining, Mingyu Feng, Tanja Käser, and Partha Talukdar (Eds.). International Educational Data Mining Society, Bengaluru, India, 12...

work page doi:10.5281/zenodo.8115647 2023
[78]

Giovanni Colavizza, Iain Hrynaszkiewicz, Isla Staden, Kirstie Whitaker, and Barbara McGillivray. 2020. The citation advantage of linking publications to research data.PLOS ONE15, 4 (04 2020), 1–18. doi:10.1371/journal.pone.0230416

work page doi:10.1371/journal.pone.0230416 2020
[79]

Computing Research and Education. 2023. ICORE Conference Rankings. Online, accessed February 5, 2025. https://www.core.edu.au/icore-portal

work page 2023
[80]

Aubrey Condor and Zachary Pardos. 2024. Explainable Automatic Grading with Neural Additive Models. InArtificial Intelligence in Education, Andrew M. Olney, Irene-Angelica Chounta, Zitao Liu, Olga C. Santos, and Ig Ibert Bittencourt (Eds.). Springer Nature Switzerland, Cham, 18–31. doi:10.1007/978-3-031-64302-6_2

work page doi:10.1007/978-3-031-64302-6_2 2024

Showing first 80 references.