Recognition: no theorem link
Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning
Pith reviewed 2026-05-13 23:13 UTC · model grok-4.3
The pith
Pretrained models detect student gaze behaviors in collaborative learning videos without any human labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed scalable AI approach leverages pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data, achieving an F1-score of 0.829 with strong performance for laptop-directed gaze and peer-directed gaze and superior cross-configuration robustness compared to other supervised machine learning approaches.
What carries the argument
A zero-shot pipeline that chains pretrained person tracking, text-prompt object detection, and gaze target prediction models applied directly to classroom video.
Load-bearing premise
The pretrained models for tracking people, spotting objects, and estimating gaze targets already work accurately on typical face-to-face collaborative learning videos without any fine-tuning.
What would settle it
Running the same pipeline on a fresh set of classroom videos recorded under different lighting, seating, or camera angles and obtaining an F1 score below 0.7 would show that the models do not generalize as claimed.
read the original abstract
Previous studies have illustrated the potential of analysing gaze behaviours in collaborative learning to provide educationally meaningful information for students to reflect on their learning. Over the past decades, machine learning approaches have been developed to automatically detect gaze behaviours from video data. Yet, since these approaches often require large amounts of labelled data for training, human annotation remains necessary. Additionally, researchers have questioned the cross-configuration robustness of machine learning models developed, as training datasets often fail to encompass the full range of situations encountered in educational contexts. To address these challenges, this study proposes a scalable artificial intelligence approach that leverages pretrained and foundation models to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data. The approach utilises pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction. The results indicate that the proposed approach achieves an F1-score of 0.829 in detecting students' gaze behaviours from video data, with strong performance for laptop-directed gaze and peer-directed gaze, yet weaker performance for other gaze targets. Furthermore, when compared to other supervised machine learning approaches, the proposed method demonstrates superior and more stable performance in complex contexts, highlighting its better cross-configuration robustness. The implications of this approach for supporting students' collaborative learning in real-world environments are also discussed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a scalable AI pipeline that combines pretrained YOLO11 for person tracking, YOLOE-26 (text-prompt) for education-related object detection, and Gaze-LLE for gaze-target prediction to label students' gaze behaviors in face-to-face collaborative-learning videos without any human annotation or fine-tuning. It reports an overall F1-score of 0.829, with strong results for laptop- and peer-directed gaze and weaker results for other targets, and claims that the method outperforms and is more stable than supervised baselines in complex settings, thereby offering better cross-configuration robustness.
Significance. If the empirical claims are substantiated with full methodological detail, the work would be significant for educational technology: it would demonstrate a practical route to annotation-free gaze analysis at scale, removing a major barrier to studying collaborative learning in authentic classroom environments and potentially enabling larger, more diverse datasets for reflection tools.
major comments (3)
- [Abstract] Abstract: the headline F1-score of 0.829 is presented without any information on dataset size (number of videos, students, frames), how the five gaze targets were defined and annotated for ground truth, the identity or training details of the supervised baselines, or any error analysis. These omissions make it impossible to judge whether the reported performance and stability advantage are supported by the data.
- [Method] Method / Evaluation: the central claim of scalable, annotation-free performance rests on the untested assumption that YOLO11, YOLOE-26, and Gaze-LLE generalize zero-shot to multi-student classroom footage. No domain-gap quantification, failure-case analysis, or ablation on camera angle/lighting/occlusion differences is provided, which directly undermines both the F1 result and the cross-configuration robustness assertion.
- [Results] Results: the statement that performance is 'weaker for other gaze targets' is left unquantified and without per-class metrics or confusion-matrix details, preventing assessment of whether the overall 0.829 F1 is driven by a few dominant classes or truly reflects robust behavior across contexts.
minor comments (1)
- Define 'complex contexts' operationally (e.g., number of students, camera distance, lighting variability) so that the stability claim can be reproduced.
Simulated Author's Rebuttal
We thank the referee for the constructive comments that highlight opportunities to improve the clarity and completeness of our manuscript. We address each major comment point by point below, providing clarifications based on the existing work and indicating where revisions will strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline F1-score of 0.829 is presented without any information on dataset size (number of videos, students, frames), how the five gaze targets were defined and annotated for ground truth, the identity or training details of the supervised baselines, or any error analysis. These omissions make it impossible to judge whether the reported performance and stability advantage are supported by the data.
Authors: We agree that the abstract would benefit from additional context. In the revised version, we will expand the abstract to specify the dataset composition (number of videos, students, and frames), briefly define the five gaze targets (laptop, peer, teacher, whiteboard, and other), note that ground-truth labels were obtained via human annotation solely for evaluation, identify the supervised baselines (including their architectures and training regimes), and reference the error analysis already present in the Results section. These details are fully elaborated in the Methods and Results sections. revision: yes
-
Referee: [Method] Method / Evaluation: the central claim of scalable, annotation-free performance rests on the untested assumption that YOLO11, YOLOE-26, and Gaze-LLE generalize zero-shot to multi-student classroom footage. No domain-gap quantification, failure-case analysis, or ablation on camera angle/lighting/occlusion differences is provided, which directly undermines both the F1 result and the cross-configuration robustness assertion.
Authors: The pipeline is explicitly zero-shot: the three models are applied in their pretrained form with no fine-tuning or target-domain labels. To address the concern, we will add a new subsection discussing observed domain gaps, including qualitative failure-case analysis (e.g., instances of heavy occlusion or extreme lighting) drawn from the collected classroom videos, and note the range of camera angles and configurations present in the dataset. While a controlled quantitative ablation would require new experiments beyond the current scope, the multi-configuration results already demonstrate stability; we will make this evidence more explicit and add a limitations paragraph on remaining generalization questions. revision: partial
-
Referee: [Results] Results: the statement that performance is 'weaker for other gaze targets' is left unquantified and without per-class metrics or confusion-matrix details, preventing assessment of whether the overall 0.829 F1 is driven by a few dominant classes or truly reflects robust behavior across contexts.
Authors: We accept this criticism and will include a table of per-class F1-scores together with a confusion matrix in the revised Results section. This will quantify the contribution of each gaze target to the aggregate score and allow direct evaluation of robustness across all classes. revision: yes
Circularity Check
No circularity: empirical zero-shot application of external pretrained models
full rationale
The paper reports an empirical pipeline that applies three externally pretrained models (YOLO11, YOLOE-26, Gaze-LLE) to classroom video and measures F1 on held-out clips. No equations, parameter fitting, or self-citation chains appear in the provided text. The 0.829 F1 and stability claims are direct evaluation outputs, not quantities forced by construction from the inputs. The zero-shot transfer assumption is a methodological premise whose validity is testable against external benchmarks and does not reduce the reported metrics to tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained computer vision models (YOLO11, YOLOE-26, Gaze-LLE) can be applied directly to face-to-face collaborative learning videos to accurately detect gaze targets without fine-tuning or labeled data.
Reference graph
Works this paper leans on
-
[1]
Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning Junyuan Liang1 [0000-0001-9119-2885], Qi Zhou1 [0000-0002-4694-4598], Sahan Bulathwela2 [0000-0002-5878-2143], and Mutlu Cukurova1 2 [0000-0001-5843-4854] 1 UCL Knowledge Lab, University College London, London, UK 2 UCL Center for Artificial Intell...
work page 2026
-
[2]
Cukurova, M., Luckin, R., Millán, E., Mavrikis, M.: The NISPI framework: Analysing col-laborative problem-solving from students’ physical interactions. Computers & Education. 116, 93–109 (2018). https://doi.org/10.1016/j.compedu.2017.08.007
-
[3]
Spikol, D., Ruffaldi, E., Dabisias, G., Cukurova, M.: Supervised machine learning in multi-modal learning analytics for estimating success in project‐based learning. Computer As-sisted Learning. 34, 366–377 (2018). https://doi.org/10.1111/jcal.12263
-
[4]
In: Viberg, O., Jivet, I., Muñoz-Merino, P.J., Perifanou, M., and Papathoma, T
Zhou, Q., Bhattacharya, A., Suraworachet, W., Nagahara, H., Cukurova, M.: Automated Detection of Students’ Gaze Interactions in Collaborative Learning Videos: A Novel Ap-proach. In: Viberg, O., Jivet, I., Muñoz-Merino, P.J., Perifanou, M., and Papathoma, T. (eds.) Responsive and Sustainable Educational Futures. pp. 504–517. Springer Nature Swit-zerland, C...
-
[5]
Zhou, Q., Suraworachet, W., Cukurova, M.: Detecting non-verbal speech and gaze behav-iours with multimodal data and computer vision to interpret effective collaborative learning interactions. Educ Inf Technol. 29, 1071–1098 (2024). https://doi.org/10.1007/s10639-023-12315-1
-
[6]
Zhou, Q., Suraworachet, W., Celiktutan, O., Cukurova, M.: What Does Shared Understand-ing in Students’ Face-to-Face Collaborative Learning Gaze Behaviours “Look Like”? In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., and Dimitrova, V. (eds.) Artificial Intelligence in Education. pp. 588–593. Springer International Publishing, Cham (2022). https://doi.org/10...
-
[7]
Schneider, B., Pea, R.: Real-time mutual gaze perception enhances collaborative learning and collaboration quality. Intern. J. Comput.-Support. Collab. Learn. 8, 375–397 (2013). https://doi.org/10.1007/s11412-013-9181-4. 14 J. Liang et al
-
[8]
In: Von Davier, A.A., Zhu, M., and Kyllonen, P.C
Olsen, J.K., Aleven, V., Rummel, N.: Exploring Dual Eye Tracking as a Tool to Assess Collaboration. In: Von Davier, A.A., Zhu, M., and Kyllonen, P.C. (eds.) Innovative Assess-ment of Collaboration. pp. 157–172. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-33261-1_10
-
[9]
Liu, Q., Jiang, X., Jiang, R.: Classroom Behavior Recognition Using Computer Vision: A Systematic Review. Sensors. 25, 373 (2025). https://doi.org/10.3390/s25020373
-
[10]
Carroll, M., Ruble, M., Dranias, M., Rebensky, S., Chaparro, M., Chiang, J., Winslow, B.: Automatic Detection of Learner Engagement Using Machine Learning and Wearable Sen-sors. JBBS. 10, 165–178 (2020). https://doi.org/10.4236/jbbs.2020.103010
-
[11]
Schneider, B., Sharma, K., Cuendet, S., Zufferey, G., Dillenbourg, P., Pea, R.: Leveraging mobile eye-trackers to capture joint visual attention in co-located collaborative learning groups. Intern. J. Comput.-Support. Collab. Learn. 13, 241–261 (2018). https://doi.org/10.1007/s11412-018-9281-2
-
[12]
In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., Torralba, A.: Eye Tracking for Everyone. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2176–2184. IEEE, Las Vegas, NV, USA (2016). https://doi.org/10.1109/CVPR.2016.239
-
[13]
https://doi.org/10.48550/ARXIV.2508.15782
Penchala, S., Kontham, S.R., Bhattacharjee, P., Mahmoodi, N., Fonseca, D., Karami, S., Ghahremani, M., Perkins, A.D., Rahimi, S., Golilarz, N.A.: Learning in Focus: Detecting Behavioral and Collaborative Engagement Using Vision Transformers, https://arxiv.org/abs/2508.15782, (2025). https://doi.org/10.48550/ARXIV.2508.15782
-
[14]
Machine Vision and Applica-tions
Huang, Q., Veeraraghavan, A., Sabharwal, A.: TabletGaze: dataset and analysis for uncon-strained appearance-based gaze estimation in mobile tablets. Machine Vision and Applica-tions. 28, 445–461 (2017). https://doi.org/10.1007/s00138-017-0852-4
-
[15]
Chong, E., Wang, Y., Ruiz, N., Rehg, J.M.: Detecting Attended Visual Targets in Video. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5395–5405. IEEE, Seattle, WA, USA (2020). https://doi.org/10.1109/CVPR42600.2020.00544
-
[16]
https://doi.org/10.48550/ARXIV.2509.25164
Sapkota, R., Cheppally, R.H., Sharda, A., Karkee, M.: YOLO26: Key Architectural En-hancements and Performance Benchmarking for Real-Time Object Detection, https://arxiv.org/abs/2509.25164, (2025). https://doi.org/10.48550/ARXIV.2509.25164
-
[17]
YOLOv11: An Overview of the Key Architectural Enhancements
Khanam, R., Hussain, M.: YOLOv11: An Overview of the Key Architectural Enhance-ments, https://arxiv.org/abs/2410.17725, (2024). https://doi.org/10.48550/ARXIV.2410.17725
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.17725 2024
-
[18]
https://doi.org/10.48550/ARXIV.2412.09586
Ryan, F., Bati, A., Lee, S., Bolya, D., Hoffman, J., Rehg, J.M.: Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders, https://arxiv.org/abs/2412.09586, (2024). https://doi.org/10.48550/ARXIV.2412.09586
-
[19]
Fernandez-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of clas-sifiers to solve real world classification problems? The Journal of Machine Learning Re-search. 15, 3133–3181 (2014)
work page 2014
-
[20]
In: Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
Yacouby, R., Axman, D.: Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. In: Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. pp. 79–91. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.eval4nlp-1.9
-
[21]
Rainio, O., Teuho, J., Klén, R.: Evaluation metrics and statistical tests for machine learning. Sci Rep. 14, 6086 (2024). https://doi.org/10.1038/s41598-024-56706-x
-
[22]
Müller, P., Huang, M.X., Zhang, X., Bulling, A.: Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In: Proceedings of the 2018 Detecting Gaze Behaviours in Face-to-Face Collaborative Learning 15 ACM Symposium on Eye Tracking Research & Applications. pp. 1–10. ACM, Warsaw Po-land (2018). https://doi.org/10...
-
[23]
https://doi.org/10.1007/978-3-031-26316-3_4
13844, 52–70 (2023). https://doi.org/10.1007/978-3-031-26316-3_4
-
[24]
Stiefelhagen, R., Jie Yang, Waibel, A.: Modeling focus of attention for meeting indexing based on multiple cues. IEEE Trans. Neural Netw. 13, 928–938 (2002). https://doi.org/10.1109/TNN.2002.1021893
-
[25]
Rodríguez-Ortiz, M.Á., Santana-Mancilla, P.C., Anido-Rifón, L.E.: Machine Learning and Generative AI in Learning Analytics for Higher Education: A Systematic Review of Models, Trends, and Challenges. Applied Sciences. 15, 8679 (2025). https://doi.org/10.3390/app15158679
-
[26]
Mathrani, A., Susnjak, T., Ramaswami, G., Barczak, A.: Perspectives on the challenges of generalizability, transparency and ethics in predictive learning analytics. Computers and Ed-ucation Open. 2, 100060 (2021). https://doi.org/10.1016/j.caeo.2021.100060
-
[27]
In: LAK22: 12th International Learning Ana-lytics and Knowledge Conference
Yan, L., Zhao, L., Gasevic, D., Martinez-Maldonado, R.: Scalability, Sustainability, and Ethicality of Multimodal Learning Analytics. In: LAK22: 12th International Learning Ana-lytics and Knowledge Conference. pp. 13–23. ACM, Online USA (2022). https://doi.org/10.1145/3506860.3506862
-
[28]
Suraworachet, W., Zhou, Q., Cukurova, M.: University Students’ Perceptions of a Multi-modal AI System for Real-World Collaboration Analytics: Lessons Learned from a Case Study. Computer Assisted Learning. 41, e70103 (2025). https://doi.org/10.1111/jcal.70103
-
[29]
Whitehead, R., Nguyen, A., Järvelä, S.: Utilizing Multimodal Large Language Models for Video Analysis of Posture in Studying Collaborative Learning: A Case Study. Learning An-alytics. 12, 186–200 (2025). https://doi.org/10.18608/jla.2025.8595
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.