arxiv: 2604.03317 · v2 · submitted 2026-04-01 · 💻 cs.CV

Recognition: no theorem link

Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning

Junyuan Liang , Qi Zhou , Sahan Bulathwela , Mutlu Cukurova

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords gaze detectioncollaborative learningpretrained modelsface-to-face interactionvideo analysisstudent behaviorAI in educationscalable detection

0 comments

The pith

Pretrained models detect student gaze behaviors in collaborative learning videos without any human labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the difficulty of automatically reading where students look during in-person group work, information that can help them reflect on how they collaborate. Past machine learning systems for this task demanded large amounts of hand-labeled video, which is expensive and hard to scale across different classrooms. The authors instead link three existing pretrained models: one tracks people, another finds objects such as laptops using text prompts, and a third estimates where each person is looking. This combination produces an overall F1 score of 0.829, performs best on gazes toward peers or laptops, and holds up more steadily than supervised alternatives when the room layout or camera angle changes. If the method works broadly, schools could analyze group dynamics at scale and give students feedback without extra annotation work.

Core claim

The proposed scalable AI approach leverages pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data, achieving an F1-score of 0.829 with strong performance for laptop-directed gaze and peer-directed gaze and superior cross-configuration robustness compared to other supervised machine learning approaches.

What carries the argument

A zero-shot pipeline that chains pretrained person tracking, text-prompt object detection, and gaze target prediction models applied directly to classroom video.

Load-bearing premise

The pretrained models for tracking people, spotting objects, and estimating gaze targets already work accurately on typical face-to-face collaborative learning videos without any fine-tuning.

What would settle it

Running the same pipeline on a fresh set of classroom videos recorded under different lighting, seating, or camera angles and obtaining an F1 score below 0.7 would show that the models do not generalize as claimed.

read the original abstract

Previous studies have illustrated the potential of analysing gaze behaviours in collaborative learning to provide educationally meaningful information for students to reflect on their learning. Over the past decades, machine learning approaches have been developed to automatically detect gaze behaviours from video data. Yet, since these approaches often require large amounts of labelled data for training, human annotation remains necessary. Additionally, researchers have questioned the cross-configuration robustness of machine learning models developed, as training datasets often fail to encompass the full range of situations encountered in educational contexts. To address these challenges, this study proposes a scalable artificial intelligence approach that leverages pretrained and foundation models to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data. The approach utilises pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction. The results indicate that the proposed approach achieves an F1-score of 0.829 in detecting students' gaze behaviours from video data, with strong performance for laptop-directed gaze and peer-directed gaze, yet weaker performance for other gaze targets. Furthermore, when compared to other supervised machine learning approaches, the proposed method demonstrates superior and more stable performance in complex contexts, highlighting its better cross-configuration robustness. The implications of this approach for supporting students' collaborative learning in real-world environments are also discussed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper chains YOLO11, YOLOE-26 and Gaze-LLE into a zero-shot pipeline for labeling gaze in collaborative classroom videos and reports 0.829 F1 with better stability than supervised baselines, but the numbers rest on thin evaluation details and untested domain transfer.

read the letter

The paper chains three off-the-shelf models to detect where students are looking during face-to-face group work without collecting any new labels. YOLO11 tracks the people, YOLOE-26 spots objects like laptops through text prompts, and Gaze-LLE predicts the gaze targets. On the videos they tested this produces an F1 of 0.829, stronger on laptop and peer gazes than on other targets, and it holds up more steadily across different setups than the supervised models they compare it to. That is the actual contribution: a practical way to skip the usual annotation step for this kind of educational video analysis. The text-prompt route for objects is a reasonable choice because classrooms contain varying items that would otherwise need custom detectors. The motivation is also clear—large labeled datasets are expensive and the existing supervised approaches often fail when the room layout or camera angle changes. The main gaps are in the evidence. The abstract and results give almost no information on how many videos or students were involved, exactly how the gaze targets were defined, or what the error patterns look like. Without those numbers it is hard to judge whether the stability advantage is real or tied to the particular clips they happened to test. The models come from general datasets, so differences in lighting, occlusions, and multi-person scenes in real classrooms could easily degrade performance more than shown here. The paper is aimed at researchers in educational technology who want scalable ways to study group interactions from video. A reader who needs a starting point for annotation-free analysis would get a usable pipeline description and some initial numbers to build on. I would bring it to a reading group to talk through the domain-shift risks and what extra checks would be needed. I would not cite it in my own work until the evaluation is expanded. It should go to peer review so the authors can supply the missing dataset size, target definitions, and error analysis.

Referee Report

3 major / 1 minor

Summary. The paper proposes a scalable AI pipeline that combines pretrained YOLO11 for person tracking, YOLOE-26 (text-prompt) for education-related object detection, and Gaze-LLE for gaze-target prediction to label students' gaze behaviors in face-to-face collaborative-learning videos without any human annotation or fine-tuning. It reports an overall F1-score of 0.829, with strong results for laptop- and peer-directed gaze and weaker results for other targets, and claims that the method outperforms and is more stable than supervised baselines in complex settings, thereby offering better cross-configuration robustness.

Significance. If the empirical claims are substantiated with full methodological detail, the work would be significant for educational technology: it would demonstrate a practical route to annotation-free gaze analysis at scale, removing a major barrier to studying collaborative learning in authentic classroom environments and potentially enabling larger, more diverse datasets for reflection tools.

major comments (3)

[Abstract] Abstract: the headline F1-score of 0.829 is presented without any information on dataset size (number of videos, students, frames), how the five gaze targets were defined and annotated for ground truth, the identity or training details of the supervised baselines, or any error analysis. These omissions make it impossible to judge whether the reported performance and stability advantage are supported by the data.
[Method] Method / Evaluation: the central claim of scalable, annotation-free performance rests on the untested assumption that YOLO11, YOLOE-26, and Gaze-LLE generalize zero-shot to multi-student classroom footage. No domain-gap quantification, failure-case analysis, or ablation on camera angle/lighting/occlusion differences is provided, which directly undermines both the F1 result and the cross-configuration robustness assertion.
[Results] Results: the statement that performance is 'weaker for other gaze targets' is left unquantified and without per-class metrics or confusion-matrix details, preventing assessment of whether the overall 0.829 F1 is driven by a few dominant classes or truly reflects robust behavior across contexts.

minor comments (1)

Define 'complex contexts' operationally (e.g., number of students, camera distance, lighting variability) so that the stability claim can be reproduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that highlight opportunities to improve the clarity and completeness of our manuscript. We address each major comment point by point below, providing clarifications based on the existing work and indicating where revisions will strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the headline F1-score of 0.829 is presented without any information on dataset size (number of videos, students, frames), how the five gaze targets were defined and annotated for ground truth, the identity or training details of the supervised baselines, or any error analysis. These omissions make it impossible to judge whether the reported performance and stability advantage are supported by the data.

Authors: We agree that the abstract would benefit from additional context. In the revised version, we will expand the abstract to specify the dataset composition (number of videos, students, and frames), briefly define the five gaze targets (laptop, peer, teacher, whiteboard, and other), note that ground-truth labels were obtained via human annotation solely for evaluation, identify the supervised baselines (including their architectures and training regimes), and reference the error analysis already present in the Results section. These details are fully elaborated in the Methods and Results sections. revision: yes
Referee: [Method] Method / Evaluation: the central claim of scalable, annotation-free performance rests on the untested assumption that YOLO11, YOLOE-26, and Gaze-LLE generalize zero-shot to multi-student classroom footage. No domain-gap quantification, failure-case analysis, or ablation on camera angle/lighting/occlusion differences is provided, which directly undermines both the F1 result and the cross-configuration robustness assertion.

Authors: The pipeline is explicitly zero-shot: the three models are applied in their pretrained form with no fine-tuning or target-domain labels. To address the concern, we will add a new subsection discussing observed domain gaps, including qualitative failure-case analysis (e.g., instances of heavy occlusion or extreme lighting) drawn from the collected classroom videos, and note the range of camera angles and configurations present in the dataset. While a controlled quantitative ablation would require new experiments beyond the current scope, the multi-configuration results already demonstrate stability; we will make this evidence more explicit and add a limitations paragraph on remaining generalization questions. revision: partial
Referee: [Results] Results: the statement that performance is 'weaker for other gaze targets' is left unquantified and without per-class metrics or confusion-matrix details, preventing assessment of whether the overall 0.829 F1 is driven by a few dominant classes or truly reflects robust behavior across contexts.

Authors: We accept this criticism and will include a table of per-class F1-scores together with a confusion matrix in the revised Results section. This will quantify the contribution of each gaze target to the aggregate score and allow direct evaluation of robustness across all classes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical zero-shot application of external pretrained models

full rationale

The paper reports an empirical pipeline that applies three externally pretrained models (YOLO11, YOLOE-26, Gaze-LLE) to classroom video and measures F1 on held-out clips. No equations, parameter fitting, or self-citation chains appear in the provided text. The 0.829 F1 and stability claims are direct evaluation outputs, not quantities forced by construction from the inputs. The zero-shot transfer assumption is a methodological premise whose validity is testable against external benchmarks and does not reduce the reported metrics to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that off-the-shelf pretrained models transfer directly to this educational video domain without adaptation or additional training data.

axioms (1)

domain assumption Pretrained computer vision models (YOLO11, YOLOE-26, Gaze-LLE) can be applied directly to face-to-face collaborative learning videos to accurately detect gaze targets without fine-tuning or labeled data.
Invoked in the description of the proposed approach that relies on these models for tracking, object detection, and gaze prediction.

pith-pipeline@v0.9.0 · 5566 in / 1383 out tokens · 46399 ms · 2026-05-13T23:13:27.626814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning Junyuan Liang1 [0000-0001-9119-2885], Qi Zhou1 [0000-0002-4694-4598], Sahan Bulathwela2 [0000-0002-5878-2143], and Mutlu Cukurova1 2 [0000-0001-5843-4854] 1 UCL Knowledge Lab, University College London, London, UK 2 UCL Center for Artificial Intell...

work page 2026
[2]

Computers & Education

Cukurova, M., Luckin, R., Millán, E., Mavrikis, M.: The NISPI framework: Analysing col-laborative problem-solving from students’ physical interactions. Computers & Education. 116, 93–109 (2018). https://doi.org/10.1016/j.compedu.2017.08.007

work page doi:10.1016/j.compedu.2017.08.007 2018
[3]

Computer As-sisted Learning

Spikol, D., Ruffaldi, E., Dabisias, G., Cukurova, M.: Supervised machine learning in multi-modal learning analytics for estimating success in project‐based learning. Computer As-sisted Learning. 34, 366–377 (2018). https://doi.org/10.1111/jcal.12263

work page doi:10.1111/jcal.12263 2018
[4]

In: Viberg, O., Jivet, I., Muñoz-Merino, P.J., Perifanou, M., and Papathoma, T

Zhou, Q., Bhattacharya, A., Suraworachet, W., Nagahara, H., Cukurova, M.: Automated Detection of Students’ Gaze Interactions in Collaborative Learning Videos: A Novel Ap-proach. In: Viberg, O., Jivet, I., Muñoz-Merino, P.J., Perifanou, M., and Papathoma, T. (eds.) Responsive and Sustainable Educational Futures. pp. 504–517. Springer Nature Swit-zerland, C...

work page doi:10.1007/978-3-031-42682-7_34 2023
[5]

Educ Inf Technol

Zhou, Q., Suraworachet, W., Cukurova, M.: Detecting non-verbal speech and gaze behav-iours with multimodal data and computer vision to interpret effective collaborative learning interactions. Educ Inf Technol. 29, 1071–1098 (2024). https://doi.org/10.1007/s10639-023-12315-1

work page doi:10.1007/s10639-023-12315-1 2024
[6]

Look Like

Zhou, Q., Suraworachet, W., Celiktutan, O., Cukurova, M.: What Does Shared Understand-ing in Students’ Face-to-Face Collaborative Learning Gaze Behaviours “Look Like”? In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., and Dimitrova, V. (eds.) Artificial Intelligence in Education. pp. 588–593. Springer International Publishing, Cham (2022). https://doi.org/10...

work page doi:10.1007/978-3-031-11644-5_53 2022
[7]

Schneider, B., Pea, R.: Real-time mutual gaze perception enhances collaborative learning and collaboration quality. Intern. J. Comput.-Support. Collab. Learn. 8, 375–397 (2013). https://doi.org/10.1007/s11412-013-9181-4. 14 J. Liang et al

work page doi:10.1007/s11412-013-9181-4 2013
[8]

In: Von Davier, A.A., Zhu, M., and Kyllonen, P.C

Olsen, J.K., Aleven, V., Rummel, N.: Exploring Dual Eye Tracking as a Tool to Assess Collaboration. In: Von Davier, A.A., Zhu, M., and Kyllonen, P.C. (eds.) Innovative Assess-ment of Collaboration. pp. 157–172. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-33261-1_10

work page doi:10.1007/978-3-319-33261-1_10 2017
[9]

Liu, Q., Jiang, X., Jiang, R.: Classroom Behavior Recognition Using Computer Vision: A Systematic Review. Sensors. 25, 373 (2025). https://doi.org/10.3390/s25020373

work page doi:10.3390/s25020373 2025
[10]

Carroll, M., Ruble, M., Dranias, M., Rebensky, S., Chaparro, M., Chiang, J., Winslow, B.: Automatic Detection of Learner Engagement Using Machine Learning and Wearable Sen-sors. JBBS. 10, 165–178 (2020). https://doi.org/10.4236/jbbs.2020.103010

work page doi:10.4236/jbbs.2020.103010 2020
[11]

Schneider, B., Sharma, K., Cuendet, S., Zufferey, G., Dillenbourg, P., Pea, R.: Leveraging mobile eye-trackers to capture joint visual attention in co-located collaborative learning groups. Intern. J. Comput.-Support. Collab. Learn. 13, 241–261 (2018). https://doi.org/10.1007/s11412-018-9281-2

work page doi:10.1007/s11412-018-9281-2 2018
[12]

In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., Torralba, A.: Eye Tracking for Everyone. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2176–2184. IEEE, Las Vegas, NV, USA (2016). https://doi.org/10.1109/CVPR.2016.239

work page doi:10.1109/cvpr.2016.239 2016
[13]

https://doi.org/10.48550/ARXIV.2508.15782

Penchala, S., Kontham, S.R., Bhattacharjee, P., Mahmoodi, N., Fonseca, D., Karami, S., Ghahremani, M., Perkins, A.D., Rahimi, S., Golilarz, N.A.: Learning in Focus: Detecting Behavioral and Collaborative Engagement Using Vision Transformers, https://arxiv.org/abs/2508.15782, (2025). https://doi.org/10.48550/ARXIV.2508.15782

work page doi:10.48550/arxiv.2508.15782 2025
[14]

Machine Vision and Applica-tions

Huang, Q., Veeraraghavan, A., Sabharwal, A.: TabletGaze: dataset and analysis for uncon-strained appearance-based gaze estimation in mobile tablets. Machine Vision and Applica-tions. 28, 445–461 (2017). https://doi.org/10.1007/s00138-017-0852-4

work page doi:10.1007/s00138-017-0852-4 2017
[15]

moco , url=

Chong, E., Wang, Y., Ruiz, N., Rehg, J.M.: Detecting Attended Visual Targets in Video. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5395–5405. IEEE, Seattle, WA, USA (2020). https://doi.org/10.1109/CVPR42600.2020.00544

work page doi:10.1109/cvpr42600.2020.00544 2020
[16]

https://doi.org/10.48550/ARXIV.2509.25164

Sapkota, R., Cheppally, R.H., Sharda, A., Karkee, M.: YOLO26: Key Architectural En-hancements and Performance Benchmarking for Real-Time Object Detection, https://arxiv.org/abs/2509.25164, (2025). https://doi.org/10.48550/ARXIV.2509.25164

work page doi:10.48550/arxiv.2509.25164 2025
[17]

YOLOv11: An Overview of the Key Architectural Enhancements

Khanam, R., Hussain, M.: YOLOv11: An Overview of the Key Architectural Enhance-ments, https://arxiv.org/abs/2410.17725, (2024). https://doi.org/10.48550/ARXIV.2410.17725

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.17725 2024
[18]

https://doi.org/10.48550/ARXIV.2412.09586

Ryan, F., Bati, A., Lee, S., Bolya, D., Hoffman, J., Rehg, J.M.: Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders, https://arxiv.org/abs/2412.09586, (2024). https://doi.org/10.48550/ARXIV.2412.09586

work page doi:10.48550/arxiv.2412.09586 2024
[19]

15, 3133–3181 (2014)

Fernandez-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of clas-sifiers to solve real world classification problems? The Journal of Machine Learning Re-search. 15, 3133–3181 (2014)

work page 2014
[20]

In: Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

Yacouby, R., Axman, D.: Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. In: Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. pp. 79–91. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.eval4nlp-1.9

work page doi:10.18653/v1/2020.eval4nlp-1.9 2020
[21]

Rainio, O., Teuho, J., Klén, R.: Evaluation metrics and statistical tests for machine learning. Sci Rep. 14, 6086 (2024). https://doi.org/10.1038/s41598-024-56706-x

work page doi:10.1038/s41598-024-56706-x 2024
[22]

In: Proceedings of the 2018 Detecting Gaze Behaviours in Face-to-Face Collaborative Learning 15 ACM Symposium on Eye Tracking Research & Applications

Müller, P., Huang, M.X., Zhang, X., Bulling, A.: Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In: Proceedings of the 2018 Detecting Gaze Behaviours in Face-to-Face Collaborative Learning 15 ACM Symposium on Eye Tracking Research & Applications. pp. 1–10. ACM, Warsaw Po-land (2018). https://doi.org/10...

work page doi:10.1145/3204493.3204549 2018
[23]

https://doi.org/10.1007/978-3-031-26316-3_4

13844, 52–70 (2023). https://doi.org/10.1007/978-3-031-26316-3_4

work page doi:10.1007/978-3-031-26316-3_4 2023
[24]

IEEE Trans

Stiefelhagen, R., Jie Yang, Waibel, A.: Modeling focus of attention for meeting indexing based on multiple cues. IEEE Trans. Neural Netw. 13, 928–938 (2002). https://doi.org/10.1109/TNN.2002.1021893

work page doi:10.1109/tnn.2002.1021893 2002
[25]

Applied Sciences

Rodríguez-Ortiz, M.Á., Santana-Mancilla, P.C., Anido-Rifón, L.E.: Machine Learning and Generative AI in Learning Analytics for Higher Education: A Systematic Review of Models, Trends, and Challenges. Applied Sciences. 15, 8679 (2025). https://doi.org/10.3390/app15158679

work page doi:10.3390/app15158679 2025
[26]

Computers and Ed-ucation Open

Mathrani, A., Susnjak, T., Ramaswami, G., Barczak, A.: Perspectives on the challenges of generalizability, transparency and ethics in predictive learning analytics. Computers and Ed-ucation Open. 2, 100060 (2021). https://doi.org/10.1016/j.caeo.2021.100060

work page doi:10.1016/j.caeo.2021.100060 2021
[27]

In: LAK22: 12th International Learning Ana-lytics and Knowledge Conference

Yan, L., Zhao, L., Gasevic, D., Martinez-Maldonado, R.: Scalability, Sustainability, and Ethicality of Multimodal Learning Analytics. In: LAK22: 12th International Learning Ana-lytics and Knowledge Conference. pp. 13–23. ACM, Online USA (2022). https://doi.org/10.1145/3506860.3506862

work page doi:10.1145/3506860.3506862 2022
[28]

Computer Assisted Learning

Suraworachet, W., Zhou, Q., Cukurova, M.: University Students’ Perceptions of a Multi-modal AI System for Real-World Collaboration Analytics: Lessons Learned from a Case Study. Computer Assisted Learning. 41, e70103 (2025). https://doi.org/10.1111/jcal.70103

work page doi:10.1111/jcal.70103 2025
[29]

Learning An-alytics

Whitehead, R., Nguyen, A., Järvelä, S.: Utilizing Multimodal Large Language Models for Video Analysis of Posture in Studying Collaborative Learning: A Case Study. Learning An-alytics. 12, 186–200 (2025). https://doi.org/10.18608/jla.2025.8595

work page doi:10.18608/jla.2025.8595 2025