DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild

Carolina Lorenzo \'Alvarez; Diego Viejo; Jorge Fernandez-Herrero; Luis Marquez-Carpintero; Miguel Cazorla; Rosabel Roig-Vila; Sergio Suescun-Ferrandiz

arxiv: 2502.20209 · v3 · submitted 2025-02-27 · 💻 cs.CV · cs.AI

DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild

Luis Marquez-Carpintero , Sergio Suescun-Ferrandiz , Carolina Lorenzo \'Alvarez , Jorge Fernandez-Herrero , Diego Viejo , Rosabel Roig-Vila , Miguel Cazorla This is my paper

Pith reviewed 2026-05-23 02:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords student engagementattention recognitionemotion detectionclassroom datasetmulti-camera recordingsmartwatch sensorsin-the-wild dataengagement labels

0 comments

The pith

A dataset records student attention and emotion using multiple classroom cameras plus smartwatch sensors with both self and expert labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new collection of recordings meant to let machine learning systems learn to recognize how attentive students are during ordinary in-person lessons. It gathers video from several angles per student to see both faces and posture, adds readings from wearable sensors on each wrist, and supplies attention and emotion ratings made by the students themselves together with four separate experts. The collection is offered as more complete than earlier ones because it works in real classrooms rather than labs and includes students from ethnic backgrounds that are often missing from similar data. If the labels hold up, algorithms trained on these recordings could link visible behavior and physiological signals to attention levels and to emotional states at the same time.

Core claim

The authors introduce the DIPSER dataset as the most comprehensive resource currently available for student attention and emotion recognition because it alone combines facial and environmental RGB camera streams, smartwatch sensor metrics, and attention-emotion labels produced by self-report plus four experts, all captured in in-the-wild in-person classroom sessions that include underrepresented ethnicities.

What carries the argument

The DIPSER dataset, which synchronizes multi-camera RGB video of face and posture with smartwatch physiological readings and supplies dual-source attention and emotion labels for each student.

If this is right

Systems can be trained to estimate attention from combined posture, facial expression, and sensor signals during live lessons.
Researchers gain material to study how emotional states correspond with attention levels inside actual classrooms.
Training examples now cover ethnic groups that appear infrequently in earlier engagement datasets.
Work on engagement recognition can move from controlled laboratory conditions to everyday in-person teaching environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recordings could support tools that give teachers immediate signals about which students have lost focus.
The dual labeling method might be checked for agreement when the same sessions are rated by larger or more diverse groups of experts.
Patterns found in the data could be compared across age groups or subject areas to test whether engagement signatures stay stable.
Adding longer continuous recordings might reveal how attention changes over the course of a full class period.

Load-bearing premise

Ratings made by students and four experts accurately reflect genuine attention and emotion without systematic bias from the camera setup or from the raters themselves.

What would settle it

Models trained on the dataset produce attention predictions that match neither new self-reports nor independent expert ratings when tested in different classrooms or with different teachers.

Figures

Figures reproduced from arXiv: 2502.20209 by Carolina Lorenzo \'Alvarez, Diego Viejo, Jorge Fernandez-Herrero, Luis Marquez-Carpintero, Miguel Cazorla, Rosabel Roig-Vila, Sergio Suescun-Ferrandiz.

**Figure 2.** Figure 2: General image of a context camera for each learning environment. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Capture setup for experiments 1 to 5 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Capture setup for experiments 6 to 9. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Changes in heart rate during the experiments [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Number of labels per experiment 11 [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Dimensional reduction TSNE watch sensors [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Descriptive statistics of students per group [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

In this paper, a novel dataset is introduced, designed to assess student attention within in-person classroom settings. This dataset encompasses RGB camera data, featuring multiple cameras per student to capture both posture and facial expressions, in addition to smartwatch sensor data for each individual. This dataset allows machine learning algorithms to be trained to predict attention and correlate it with emotion. A comprehensive suite of attention and emotion labels for each student is provided, generated through self-reporting as well as evaluations by four different experts. Our dataset uniquely combines facial and environmental camera data, smartwatch metrics, and includes underrepresented ethnicities in similar datasets, all within in-the-wild, in-person settings, making it the most comprehensive dataset of its kind currently available. The dataset presented offers an extensive and diverse collection of data pertaining to student interactions across different educational contexts, augmented with additional metadata from other tools. This initiative addresses existing deficiencies by offering a valuable resource for the analysis of student attention and emotion in face-to-face lessons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the DIPSER dataset for recognizing student engagement and emotion in in-person classroom settings. It includes multi-camera RGB video (facial and environmental views per student), smartwatch sensor data, and labels for attention/emotion generated via self-reports combined with evaluations from four experts. The paper claims this combination of modalities, plus inclusion of underrepresented ethnicities in an in-the-wild setting, makes DIPSER the most comprehensive dataset of its kind.

Significance. If the data release is complete, the labels are shown to be reliable, and quantitative comparisons confirm the claimed advantages in scale and diversity, the dataset could provide a useful multimodal resource for training engagement-recognition models that better reflect real classroom conditions and demographic variation.

major comments (2)

[Abstract] Abstract: the claim that the dataset is 'the most comprehensive' is not supported by any quantitative comparison table or metrics (e.g., number of subjects, total hours, ethnic distribution statistics, or label-agreement scores) against prior datasets; without such evidence the uniqueness assertion cannot be evaluated.
[Abstract] Abstract (labeling description): no inter-rater agreement statistics (e.g., Fleiss' kappa or ICC) are reported for the four-expert plus self-report labels, nor is there validation against physiological correlates or discussion of possible biases (camera reactivity, cultural expression differences); this directly affects whether the labels constitute usable ground truth for ML.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract claims require stronger quantitative support and that label reliability must be demonstrated more explicitly. We will revise the manuscript to address both points.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the dataset is 'the most comprehensive' is not supported by any quantitative comparison table or metrics (e.g., number of subjects, total hours, ethnic distribution statistics, or label-agreement scores) against prior datasets; without such evidence the uniqueness assertion cannot be evaluated.

Authors: We agree that the claim requires quantitative backing. In the revised manuscript we will insert a comparison table against prior engagement datasets, reporting number of subjects, total hours, ethnic distribution statistics, and label-agreement scores where available in the literature. revision: yes
Referee: [Abstract] Abstract (labeling description): no inter-rater agreement statistics (e.g., Fleiss' kappa or ICC) are reported for the four-expert plus self-report labels, nor is there validation against physiological correlates or discussion of possible biases (camera reactivity, cultural expression differences); this directly affects whether the labels constitute usable ground truth for ML.

Authors: We will add inter-rater agreement statistics (Fleiss' kappa) computed across the four experts in the revised version. The smartwatch data contains physiological signals (heart rate, etc.) that could support limited correlation analysis; we will either include a short validation subsection or explicitly discuss its feasibility. We will also add a dedicated paragraph addressing potential biases including camera reactivity and cultural expression differences. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release paper with no derivations or fitted results

full rationale

The paper is a dataset introduction with no equations, models, predictions, or parameter fitting. Claims about uniqueness rest on factual descriptions of data modalities, participant demographics, and labeling process (self-report + four experts), none of which reduce to self-referential definitions or self-citations. No load-bearing steps exist that could be circular by the enumerated patterns. The label reliability concern raised by the skeptic is a validity issue, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation exists; the contribution is the dataset itself. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5735 in / 1132 out tokens · 24088 ms · 2026-05-23T02:27:34.636203+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

doi:10.48550/ARXIV.1609.01885 , urldate =

Gupta, A., D’Cunha, A., Awasthi, K. & Balasubramanian, V. Daisee: Towards user engagement recognition in the wild.arXiv preprint arXiv:1609.01885(2016)

work page arXiv 2016
[2]

& Erdem, C

Zhalehpour, S., Onder, O., Akhtar, Z. & Erdem, C. E. Baum-1: A spontaneous audio-visual face database of affective and mental states.IEEE Transactions on Affective Computing8, 300–313 (2016)

work page 2016
[3]

& Dhall, A

Kaur, A., Mustafa, A., Mehta, L. & Dhall, A. Prediction and localization of student engagement in the wild. In 2018 Digital Image Computing: Techniques and Applications (DICTA), 1–8 (IEEE, 2018)

work page 2018
[4]

C., Goetz, T

Pekrun, R., Frenzel, A. C., Goetz, T. & Perry, R. P. The control-value theory of achievement emotions: An integrative approach to emotions in education. InEmotion in education, 13–36 (Elsevier, 2007)

work page 2007
[5]

C., Barchfeld, P

Pekrun, R., Goetz, T., Frenzel, A. C., Barchfeld, P. & Perry, R. P. Measuring emotions in students’ learning and performance: The achievement emotions questionnaire (aeq). Contemporary educational psychology 36, 36–48 (2011)

work page 2011
[6]

Attentive or not? toward a machine learning approach to assessing students’ visible engagement in classroom instruction.Educational Psychology Review33, 27–49 (2021)

Goldberg, P.et al. Attentive or not? toward a machine learning approach to assessing students’ visible engagement in classroom instruction.Educational Psychology Review33, 27–49 (2021)

work page 2021
[7]

& Kording, K

Liu, T., Ungar, L. & Kording, K. Quantifying causality in data science with quasi-experiments.Nature computational science1, 24–32 (2021)

work page 2021
[8]

& Jiao, P

Ouyang, F., Wu, M., Zheng, L., Zhang, L. & Jiao, P. Integration of artificial intelligence performance prediction and learning analytics to improve student learning in online engineering course.International Journal of Educational Technology in Higher Education20, 4 (2023)

work page 2023
[9]

Howtouseartificialintelligencetoimproveentrepreneurial attitude in business simulation games: implications from a quasi-experiment.Frontiers in Psychology13, 856085 (2022)

Chen, J., Chen, Y., Ou, R., Wang, J.&Chen, Q. Howtouseartificialintelligencetoimproveentrepreneurial attitude in business simulation games: implications from a quasi-experiment.Frontiers in Psychology13, 856085 (2022)

work page 2022
[10]

Marquez-Carpintero, L.et al.Author spotlight: Addressing technical and subjective challenges in measur- ing classroom attention.JoVE (Journal of Visualized Experiments)e65931 (2023)

work page 2023
[11]

Cobo, A., Valle, R., Buenaposada, J. M. & Baumela, L. On the representation and methodology for wide and short range head pose estimation.Pattern Recognition149, 110263 (2024)

work page 2024
[12]

& Tolstykh, I

Kuprashevich, M. & Tolstykh, I. Mivolo: Multi-input transformer for age and gender estimation.arXiv preprint arXiv:2307.04616(2023). 7

work page arXiv 2023
[13]

Zhang, F. et al. Mediapipe hands: On-device real-time hand tracking.arXiv preprint arXiv:2006.10214 (2020)

work page arXiv 2006
[14]

& Grundmann, M

Grishchenko, I., Ablavatski, A., Kartynnik, Y., Raveendran, K. & Grundmann, M. Attention mesh: High- fidelity face mesh prediction in real-time.arXiv preprint arXiv:2006.10962(2020)

work page arXiv 2006
[15]

Bazarevsky, V.et al.Blazepose: On-device real-time body pose tracking.arXiv preprint arXiv:2006.10204 (2020)

work page arXiv 2006
[16]

Serengil, S. I. & Ozpinar, A. Hyperextended lightface: A facial attribute analysis framework. In2021 International Conference on Engineering and Emerging Technologies (ICEET), 1–4, 10.1109/ICEET53442. 2021.9659697 (IEEE, 2021)

work page doi:10.1109/iceet53442 2021
[17]

How cranial shapes led to contemporary ethnic classification: a historical view.Turkish Neuro- surgery 28, 490–494 (2018)

Belen, D. How cranial shapes led to contemporary ethnic classification: a historical view.Turkish Neuro- surgery 28, 490–494 (2018)

work page 2018
[18]

D., Krleza-Jeric, K

Goodyear, M. D., Krleza-Jeric, K. & Lemmens, T. The declaration of helsinki (2007)

work page 2007
[19]

Do i have your attention: A large scale engagement prediction dataset and baselines

Singh, M.et al. Do i have your attention: A large scale engagement prediction dataset and baselines. In Proceedings of the 25th International Conference on Multimodal Interaction, 174–182 (2023)

work page 2023
[20]

A natural visible and infrared facial expression database for expression recognition and emotion inference

Wang, S.et al. A natural visible and infrared facial expression database for expression recognition and emotion inference. IEEE Transactions on Multimedia12, 682–691 (2010)

work page 2010
[21]

InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3628–3636 (2021)

Delgado, K.et al.Student engagement dataset. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3628–3636 (2021)

work page 2021
[22]

Meebai: Una metodología para la educación consciente de las emociones basada en la inteligencia artificial

Whitehill, J., Serpell, Z., Lin, Y.-C., Foster, A. & Movellan, J. R. The faces of engagement: Automatic recognition of student engagementfrom facial expressions.IEEE Transactions on Affective Computing5, 86–98 (2014). Acknowledgements This project has been developed under the framework of the CIPROM/2021/17 Prometeo project entitled "Meebai: Una metodolog...

work page 2014

[1] [1]

doi:10.48550/ARXIV.1609.01885 , urldate =

Gupta, A., D’Cunha, A., Awasthi, K. & Balasubramanian, V. Daisee: Towards user engagement recognition in the wild.arXiv preprint arXiv:1609.01885(2016)

work page arXiv 2016

[2] [2]

& Erdem, C

Zhalehpour, S., Onder, O., Akhtar, Z. & Erdem, C. E. Baum-1: A spontaneous audio-visual face database of affective and mental states.IEEE Transactions on Affective Computing8, 300–313 (2016)

work page 2016

[3] [3]

& Dhall, A

Kaur, A., Mustafa, A., Mehta, L. & Dhall, A. Prediction and localization of student engagement in the wild. In 2018 Digital Image Computing: Techniques and Applications (DICTA), 1–8 (IEEE, 2018)

work page 2018

[4] [4]

C., Goetz, T

Pekrun, R., Frenzel, A. C., Goetz, T. & Perry, R. P. The control-value theory of achievement emotions: An integrative approach to emotions in education. InEmotion in education, 13–36 (Elsevier, 2007)

work page 2007

[5] [5]

C., Barchfeld, P

Pekrun, R., Goetz, T., Frenzel, A. C., Barchfeld, P. & Perry, R. P. Measuring emotions in students’ learning and performance: The achievement emotions questionnaire (aeq). Contemporary educational psychology 36, 36–48 (2011)

work page 2011

[6] [6]

Attentive or not? toward a machine learning approach to assessing students’ visible engagement in classroom instruction.Educational Psychology Review33, 27–49 (2021)

Goldberg, P.et al. Attentive or not? toward a machine learning approach to assessing students’ visible engagement in classroom instruction.Educational Psychology Review33, 27–49 (2021)

work page 2021

[7] [7]

& Kording, K

Liu, T., Ungar, L. & Kording, K. Quantifying causality in data science with quasi-experiments.Nature computational science1, 24–32 (2021)

work page 2021

[8] [8]

& Jiao, P

Ouyang, F., Wu, M., Zheng, L., Zhang, L. & Jiao, P. Integration of artificial intelligence performance prediction and learning analytics to improve student learning in online engineering course.International Journal of Educational Technology in Higher Education20, 4 (2023)

work page 2023

[9] [9]

Howtouseartificialintelligencetoimproveentrepreneurial attitude in business simulation games: implications from a quasi-experiment.Frontiers in Psychology13, 856085 (2022)

Chen, J., Chen, Y., Ou, R., Wang, J.&Chen, Q. Howtouseartificialintelligencetoimproveentrepreneurial attitude in business simulation games: implications from a quasi-experiment.Frontiers in Psychology13, 856085 (2022)

work page 2022

[10] [10]

Marquez-Carpintero, L.et al.Author spotlight: Addressing technical and subjective challenges in measur- ing classroom attention.JoVE (Journal of Visualized Experiments)e65931 (2023)

work page 2023

[11] [11]

Cobo, A., Valle, R., Buenaposada, J. M. & Baumela, L. On the representation and methodology for wide and short range head pose estimation.Pattern Recognition149, 110263 (2024)

work page 2024

[12] [12]

& Tolstykh, I

Kuprashevich, M. & Tolstykh, I. Mivolo: Multi-input transformer for age and gender estimation.arXiv preprint arXiv:2307.04616(2023). 7

work page arXiv 2023

[13] [13]

Zhang, F. et al. Mediapipe hands: On-device real-time hand tracking.arXiv preprint arXiv:2006.10214 (2020)

work page arXiv 2006

[14] [14]

& Grundmann, M

Grishchenko, I., Ablavatski, A., Kartynnik, Y., Raveendran, K. & Grundmann, M. Attention mesh: High- fidelity face mesh prediction in real-time.arXiv preprint arXiv:2006.10962(2020)

work page arXiv 2006

[15] [15]

Bazarevsky, V.et al.Blazepose: On-device real-time body pose tracking.arXiv preprint arXiv:2006.10204 (2020)

work page arXiv 2006

[16] [16]

Serengil, S. I. & Ozpinar, A. Hyperextended lightface: A facial attribute analysis framework. In2021 International Conference on Engineering and Emerging Technologies (ICEET), 1–4, 10.1109/ICEET53442. 2021.9659697 (IEEE, 2021)

work page doi:10.1109/iceet53442 2021

[17] [17]

How cranial shapes led to contemporary ethnic classification: a historical view.Turkish Neuro- surgery 28, 490–494 (2018)

Belen, D. How cranial shapes led to contemporary ethnic classification: a historical view.Turkish Neuro- surgery 28, 490–494 (2018)

work page 2018

[18] [18]

D., Krleza-Jeric, K

Goodyear, M. D., Krleza-Jeric, K. & Lemmens, T. The declaration of helsinki (2007)

work page 2007

[19] [19]

Do i have your attention: A large scale engagement prediction dataset and baselines

Singh, M.et al. Do i have your attention: A large scale engagement prediction dataset and baselines. In Proceedings of the 25th International Conference on Multimodal Interaction, 174–182 (2023)

work page 2023

[20] [20]

A natural visible and infrared facial expression database for expression recognition and emotion inference

Wang, S.et al. A natural visible and infrared facial expression database for expression recognition and emotion inference. IEEE Transactions on Multimedia12, 682–691 (2010)

work page 2010

[21] [21]

InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3628–3636 (2021)

Delgado, K.et al.Student engagement dataset. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3628–3636 (2021)

work page 2021

[22] [22]

Meebai: Una metodología para la educación consciente de las emociones basada en la inteligencia artificial

Whitehill, J., Serpell, Z., Lin, Y.-C., Foster, A. & Movellan, J. R. The faces of engagement: Automatic recognition of student engagementfrom facial expressions.IEEE Transactions on Affective Computing5, 86–98 (2014). Acknowledgements This project has been developed under the framework of the CIPROM/2021/17 Prometeo project entitled "Meebai: Una metodolog...

work page 2014