DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild
Pith reviewed 2026-05-23 02:27 UTC · model grok-4.3
The pith
A dataset records student attention and emotion using multiple classroom cameras plus smartwatch sensors with both self and expert labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce the DIPSER dataset as the most comprehensive resource currently available for student attention and emotion recognition because it alone combines facial and environmental RGB camera streams, smartwatch sensor metrics, and attention-emotion labels produced by self-report plus four experts, all captured in in-the-wild in-person classroom sessions that include underrepresented ethnicities.
What carries the argument
The DIPSER dataset, which synchronizes multi-camera RGB video of face and posture with smartwatch physiological readings and supplies dual-source attention and emotion labels for each student.
If this is right
- Systems can be trained to estimate attention from combined posture, facial expression, and sensor signals during live lessons.
- Researchers gain material to study how emotional states correspond with attention levels inside actual classrooms.
- Training examples now cover ethnic groups that appear infrequently in earlier engagement datasets.
- Work on engagement recognition can move from controlled laboratory conditions to everyday in-person teaching environments.
Where Pith is reading between the lines
- The same recordings could support tools that give teachers immediate signals about which students have lost focus.
- The dual labeling method might be checked for agreement when the same sessions are rated by larger or more diverse groups of experts.
- Patterns found in the data could be compared across age groups or subject areas to test whether engagement signatures stay stable.
- Adding longer continuous recordings might reveal how attention changes over the course of a full class period.
Load-bearing premise
Ratings made by students and four experts accurately reflect genuine attention and emotion without systematic bias from the camera setup or from the raters themselves.
What would settle it
Models trained on the dataset produce attention predictions that match neither new self-reports nor independent expert ratings when tested in different classrooms or with different teachers.
Figures
read the original abstract
In this paper, a novel dataset is introduced, designed to assess student attention within in-person classroom settings. This dataset encompasses RGB camera data, featuring multiple cameras per student to capture both posture and facial expressions, in addition to smartwatch sensor data for each individual. This dataset allows machine learning algorithms to be trained to predict attention and correlate it with emotion. A comprehensive suite of attention and emotion labels for each student is provided, generated through self-reporting as well as evaluations by four different experts. Our dataset uniquely combines facial and environmental camera data, smartwatch metrics, and includes underrepresented ethnicities in similar datasets, all within in-the-wild, in-person settings, making it the most comprehensive dataset of its kind currently available. The dataset presented offers an extensive and diverse collection of data pertaining to student interactions across different educational contexts, augmented with additional metadata from other tools. This initiative addresses existing deficiencies by offering a valuable resource for the analysis of student attention and emotion in face-to-face lessons.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the DIPSER dataset for recognizing student engagement and emotion in in-person classroom settings. It includes multi-camera RGB video (facial and environmental views per student), smartwatch sensor data, and labels for attention/emotion generated via self-reports combined with evaluations from four experts. The paper claims this combination of modalities, plus inclusion of underrepresented ethnicities in an in-the-wild setting, makes DIPSER the most comprehensive dataset of its kind.
Significance. If the data release is complete, the labels are shown to be reliable, and quantitative comparisons confirm the claimed advantages in scale and diversity, the dataset could provide a useful multimodal resource for training engagement-recognition models that better reflect real classroom conditions and demographic variation.
major comments (2)
- [Abstract] Abstract: the claim that the dataset is 'the most comprehensive' is not supported by any quantitative comparison table or metrics (e.g., number of subjects, total hours, ethnic distribution statistics, or label-agreement scores) against prior datasets; without such evidence the uniqueness assertion cannot be evaluated.
- [Abstract] Abstract (labeling description): no inter-rater agreement statistics (e.g., Fleiss' kappa or ICC) are reported for the four-expert plus self-report labels, nor is there validation against physiological correlates or discussion of possible biases (camera reactivity, cultural expression differences); this directly affects whether the labels constitute usable ground truth for ML.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the abstract claims require stronger quantitative support and that label reliability must be demonstrated more explicitly. We will revise the manuscript to address both points.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the dataset is 'the most comprehensive' is not supported by any quantitative comparison table or metrics (e.g., number of subjects, total hours, ethnic distribution statistics, or label-agreement scores) against prior datasets; without such evidence the uniqueness assertion cannot be evaluated.
Authors: We agree that the claim requires quantitative backing. In the revised manuscript we will insert a comparison table against prior engagement datasets, reporting number of subjects, total hours, ethnic distribution statistics, and label-agreement scores where available in the literature. revision: yes
-
Referee: [Abstract] Abstract (labeling description): no inter-rater agreement statistics (e.g., Fleiss' kappa or ICC) are reported for the four-expert plus self-report labels, nor is there validation against physiological correlates or discussion of possible biases (camera reactivity, cultural expression differences); this directly affects whether the labels constitute usable ground truth for ML.
Authors: We will add inter-rater agreement statistics (Fleiss' kappa) computed across the four experts in the revised version. The smartwatch data contains physiological signals (heart rate, etc.) that could support limited correlation analysis; we will either include a short validation subsection or explicitly discuss its feasibility. We will also add a dedicated paragraph addressing potential biases including camera reactivity and cultural expression differences. revision: yes
Circularity Check
No circularity: dataset release paper with no derivations or fitted results
full rationale
The paper is a dataset introduction with no equations, models, predictions, or parameter fitting. Claims about uniqueness rest on factual descriptions of data modalities, participant demographics, and labeling process (self-report + four experts), none of which reduce to self-referential definitions or self-citations. No load-bearing steps exist that could be circular by the enumerated patterns. The label reliability concern raised by the skeptic is a validity issue, not a circularity issue.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
doi:10.48550/ARXIV.1609.01885 , urldate =
Gupta, A., D’Cunha, A., Awasthi, K. & Balasubramanian, V. Daisee: Towards user engagement recognition in the wild.arXiv preprint arXiv:1609.01885(2016)
-
[2]
Zhalehpour, S., Onder, O., Akhtar, Z. & Erdem, C. E. Baum-1: A spontaneous audio-visual face database of affective and mental states.IEEE Transactions on Affective Computing8, 300–313 (2016)
work page 2016
-
[3]
Kaur, A., Mustafa, A., Mehta, L. & Dhall, A. Prediction and localization of student engagement in the wild. In 2018 Digital Image Computing: Techniques and Applications (DICTA), 1–8 (IEEE, 2018)
work page 2018
-
[4]
Pekrun, R., Frenzel, A. C., Goetz, T. & Perry, R. P. The control-value theory of achievement emotions: An integrative approach to emotions in education. InEmotion in education, 13–36 (Elsevier, 2007)
work page 2007
-
[5]
Pekrun, R., Goetz, T., Frenzel, A. C., Barchfeld, P. & Perry, R. P. Measuring emotions in students’ learning and performance: The achievement emotions questionnaire (aeq). Contemporary educational psychology 36, 36–48 (2011)
work page 2011
-
[6]
Goldberg, P.et al. Attentive or not? toward a machine learning approach to assessing students’ visible engagement in classroom instruction.Educational Psychology Review33, 27–49 (2021)
work page 2021
-
[7]
Liu, T., Ungar, L. & Kording, K. Quantifying causality in data science with quasi-experiments.Nature computational science1, 24–32 (2021)
work page 2021
- [8]
-
[9]
Chen, J., Chen, Y., Ou, R., Wang, J.&Chen, Q. Howtouseartificialintelligencetoimproveentrepreneurial attitude in business simulation games: implications from a quasi-experiment.Frontiers in Psychology13, 856085 (2022)
work page 2022
-
[10]
Marquez-Carpintero, L.et al.Author spotlight: Addressing technical and subjective challenges in measur- ing classroom attention.JoVE (Journal of Visualized Experiments)e65931 (2023)
work page 2023
-
[11]
Cobo, A., Valle, R., Buenaposada, J. M. & Baumela, L. On the representation and methodology for wide and short range head pose estimation.Pattern Recognition149, 110263 (2024)
work page 2024
-
[12]
Kuprashevich, M. & Tolstykh, I. Mivolo: Multi-input transformer for age and gender estimation.arXiv preprint arXiv:2307.04616(2023). 7
- [13]
-
[14]
Grishchenko, I., Ablavatski, A., Kartynnik, Y., Raveendran, K. & Grundmann, M. Attention mesh: High- fidelity face mesh prediction in real-time.arXiv preprint arXiv:2006.10962(2020)
- [15]
-
[16]
Serengil, S. I. & Ozpinar, A. Hyperextended lightface: A facial attribute analysis framework. In2021 International Conference on Engineering and Emerging Technologies (ICEET), 1–4, 10.1109/ICEET53442. 2021.9659697 (IEEE, 2021)
-
[17]
Belen, D. How cranial shapes led to contemporary ethnic classification: a historical view.Turkish Neuro- surgery 28, 490–494 (2018)
work page 2018
-
[18]
Goodyear, M. D., Krleza-Jeric, K. & Lemmens, T. The declaration of helsinki (2007)
work page 2007
-
[19]
Do i have your attention: A large scale engagement prediction dataset and baselines
Singh, M.et al. Do i have your attention: A large scale engagement prediction dataset and baselines. In Proceedings of the 25th International Conference on Multimodal Interaction, 174–182 (2023)
work page 2023
-
[20]
Wang, S.et al. A natural visible and infrared facial expression database for expression recognition and emotion inference. IEEE Transactions on Multimedia12, 682–691 (2010)
work page 2010
-
[21]
Delgado, K.et al.Student engagement dataset. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3628–3636 (2021)
work page 2021
-
[22]
Whitehill, J., Serpell, Z., Lin, Y.-C., Foster, A. & Movellan, J. R. The faces of engagement: Automatic recognition of student engagementfrom facial expressions.IEEE Transactions on Affective Computing5, 86–98 (2014). Acknowledgements This project has been developed under the framework of the CIPROM/2021/17 Prometeo project entitled "Meebai: Una metodolog...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.