pith. machine review for the scientific record. sign in

arxiv: 2604.03317 · v2 · submitted 2026-04-01 · 💻 cs.CV

Recognition: no theorem link

Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords gaze detectioncollaborative learningpretrained modelsface-to-face interactionvideo analysisstudent behaviorAI in educationscalable detection
0
0 comments X

The pith

Pretrained models detect student gaze behaviors in collaborative learning videos without any human labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the difficulty of automatically reading where students look during in-person group work, information that can help them reflect on how they collaborate. Past machine learning systems for this task demanded large amounts of hand-labeled video, which is expensive and hard to scale across different classrooms. The authors instead link three existing pretrained models: one tracks people, another finds objects such as laptops using text prompts, and a third estimates where each person is looking. This combination produces an overall F1 score of 0.829, performs best on gazes toward peers or laptops, and holds up more steadily than supervised alternatives when the room layout or camera angle changes. If the method works broadly, schools could analyze group dynamics at scale and give students feedback without extra annotation work.

Core claim

The proposed scalable AI approach leverages pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data, achieving an F1-score of 0.829 with strong performance for laptop-directed gaze and peer-directed gaze and superior cross-configuration robustness compared to other supervised machine learning approaches.

What carries the argument

A zero-shot pipeline that chains pretrained person tracking, text-prompt object detection, and gaze target prediction models applied directly to classroom video.

Load-bearing premise

The pretrained models for tracking people, spotting objects, and estimating gaze targets already work accurately on typical face-to-face collaborative learning videos without any fine-tuning.

What would settle it

Running the same pipeline on a fresh set of classroom videos recorded under different lighting, seating, or camera angles and obtaining an F1 score below 0.7 would show that the models do not generalize as claimed.

read the original abstract

Previous studies have illustrated the potential of analysing gaze behaviours in collaborative learning to provide educationally meaningful information for students to reflect on their learning. Over the past decades, machine learning approaches have been developed to automatically detect gaze behaviours from video data. Yet, since these approaches often require large amounts of labelled data for training, human annotation remains necessary. Additionally, researchers have questioned the cross-configuration robustness of machine learning models developed, as training datasets often fail to encompass the full range of situations encountered in educational contexts. To address these challenges, this study proposes a scalable artificial intelligence approach that leverages pretrained and foundation models to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data. The approach utilises pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction. The results indicate that the proposed approach achieves an F1-score of 0.829 in detecting students' gaze behaviours from video data, with strong performance for laptop-directed gaze and peer-directed gaze, yet weaker performance for other gaze targets. Furthermore, when compared to other supervised machine learning approaches, the proposed method demonstrates superior and more stable performance in complex contexts, highlighting its better cross-configuration robustness. The implications of this approach for supporting students' collaborative learning in real-world environments are also discussed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes a scalable AI pipeline that combines pretrained YOLO11 for person tracking, YOLOE-26 (text-prompt) for education-related object detection, and Gaze-LLE for gaze-target prediction to label students' gaze behaviors in face-to-face collaborative-learning videos without any human annotation or fine-tuning. It reports an overall F1-score of 0.829, with strong results for laptop- and peer-directed gaze and weaker results for other targets, and claims that the method outperforms and is more stable than supervised baselines in complex settings, thereby offering better cross-configuration robustness.

Significance. If the empirical claims are substantiated with full methodological detail, the work would be significant for educational technology: it would demonstrate a practical route to annotation-free gaze analysis at scale, removing a major barrier to studying collaborative learning in authentic classroom environments and potentially enabling larger, more diverse datasets for reflection tools.

major comments (3)
  1. [Abstract] Abstract: the headline F1-score of 0.829 is presented without any information on dataset size (number of videos, students, frames), how the five gaze targets were defined and annotated for ground truth, the identity or training details of the supervised baselines, or any error analysis. These omissions make it impossible to judge whether the reported performance and stability advantage are supported by the data.
  2. [Method] Method / Evaluation: the central claim of scalable, annotation-free performance rests on the untested assumption that YOLO11, YOLOE-26, and Gaze-LLE generalize zero-shot to multi-student classroom footage. No domain-gap quantification, failure-case analysis, or ablation on camera angle/lighting/occlusion differences is provided, which directly undermines both the F1 result and the cross-configuration robustness assertion.
  3. [Results] Results: the statement that performance is 'weaker for other gaze targets' is left unquantified and without per-class metrics or confusion-matrix details, preventing assessment of whether the overall 0.829 F1 is driven by a few dominant classes or truly reflects robust behavior across contexts.
minor comments (1)
  1. Define 'complex contexts' operationally (e.g., number of students, camera distance, lighting variability) so that the stability claim can be reproduced.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that highlight opportunities to improve the clarity and completeness of our manuscript. We address each major comment point by point below, providing clarifications based on the existing work and indicating where revisions will strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline F1-score of 0.829 is presented without any information on dataset size (number of videos, students, frames), how the five gaze targets were defined and annotated for ground truth, the identity or training details of the supervised baselines, or any error analysis. These omissions make it impossible to judge whether the reported performance and stability advantage are supported by the data.

    Authors: We agree that the abstract would benefit from additional context. In the revised version, we will expand the abstract to specify the dataset composition (number of videos, students, and frames), briefly define the five gaze targets (laptop, peer, teacher, whiteboard, and other), note that ground-truth labels were obtained via human annotation solely for evaluation, identify the supervised baselines (including their architectures and training regimes), and reference the error analysis already present in the Results section. These details are fully elaborated in the Methods and Results sections. revision: yes

  2. Referee: [Method] Method / Evaluation: the central claim of scalable, annotation-free performance rests on the untested assumption that YOLO11, YOLOE-26, and Gaze-LLE generalize zero-shot to multi-student classroom footage. No domain-gap quantification, failure-case analysis, or ablation on camera angle/lighting/occlusion differences is provided, which directly undermines both the F1 result and the cross-configuration robustness assertion.

    Authors: The pipeline is explicitly zero-shot: the three models are applied in their pretrained form with no fine-tuning or target-domain labels. To address the concern, we will add a new subsection discussing observed domain gaps, including qualitative failure-case analysis (e.g., instances of heavy occlusion or extreme lighting) drawn from the collected classroom videos, and note the range of camera angles and configurations present in the dataset. While a controlled quantitative ablation would require new experiments beyond the current scope, the multi-configuration results already demonstrate stability; we will make this evidence more explicit and add a limitations paragraph on remaining generalization questions. revision: partial

  3. Referee: [Results] Results: the statement that performance is 'weaker for other gaze targets' is left unquantified and without per-class metrics or confusion-matrix details, preventing assessment of whether the overall 0.829 F1 is driven by a few dominant classes or truly reflects robust behavior across contexts.

    Authors: We accept this criticism and will include a table of per-class F1-scores together with a confusion matrix in the revised Results section. This will quantify the contribution of each gaze target to the aggregate score and allow direct evaluation of robustness across all classes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical zero-shot application of external pretrained models

full rationale

The paper reports an empirical pipeline that applies three externally pretrained models (YOLO11, YOLOE-26, Gaze-LLE) to classroom video and measures F1 on held-out clips. No equations, parameter fitting, or self-citation chains appear in the provided text. The 0.829 F1 and stability claims are direct evaluation outputs, not quantities forced by construction from the inputs. The zero-shot transfer assumption is a methodological premise whose validity is testable against external benchmarks and does not reduce the reported metrics to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that off-the-shelf pretrained models transfer directly to this educational video domain without adaptation or additional training data.

axioms (1)
  • domain assumption Pretrained computer vision models (YOLO11, YOLOE-26, Gaze-LLE) can be applied directly to face-to-face collaborative learning videos to accurately detect gaze targets without fine-tuning or labeled data.
    Invoked in the description of the proposed approach that relies on these models for tracking, object detection, and gaze prediction.

pith-pipeline@v0.9.0 · 5566 in / 1383 out tokens · 46399 ms · 2026-05-13T23:13:27.626814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative Learning Junyuan Liang1 [0000-0001-9119-2885], Qi Zhou1 [0000-0002-4694-4598], Sahan Bulathwela2 [0000-0002-5878-2143], and Mutlu Cukurova1 2 [0000-0001-5843-4854] 1 UCL Knowledge Lab, University College London, London, UK 2 UCL Center for Artificial Intell...

  2. [2]

    Computers & Education

    Cukurova, M., Luckin, R., Millán, E., Mavrikis, M.: The NISPI framework: Analysing col-laborative problem-solving from students’ physical interactions. Computers & Education. 116, 93–109 (2018). https://doi.org/10.1016/j.compedu.2017.08.007

  3. [3]

    Computer As-sisted Learning

    Spikol, D., Ruffaldi, E., Dabisias, G., Cukurova, M.: Supervised machine learning in multi-modal learning analytics for estimating success in project‐based learning. Computer As-sisted Learning. 34, 366–377 (2018). https://doi.org/10.1111/jcal.12263

  4. [4]

    In: Viberg, O., Jivet, I., Muñoz-Merino, P.J., Perifanou, M., and Papathoma, T

    Zhou, Q., Bhattacharya, A., Suraworachet, W., Nagahara, H., Cukurova, M.: Automated Detection of Students’ Gaze Interactions in Collaborative Learning Videos: A Novel Ap-proach. In: Viberg, O., Jivet, I., Muñoz-Merino, P.J., Perifanou, M., and Papathoma, T. (eds.) Responsive and Sustainable Educational Futures. pp. 504–517. Springer Nature Swit-zerland, C...

  5. [5]

    Educ Inf Technol

    Zhou, Q., Suraworachet, W., Cukurova, M.: Detecting non-verbal speech and gaze behav-iours with multimodal data and computer vision to interpret effective collaborative learning interactions. Educ Inf Technol. 29, 1071–1098 (2024). https://doi.org/10.1007/s10639-023-12315-1

  6. [6]

    Look Like

    Zhou, Q., Suraworachet, W., Celiktutan, O., Cukurova, M.: What Does Shared Understand-ing in Students’ Face-to-Face Collaborative Learning Gaze Behaviours “Look Like”? In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., and Dimitrova, V. (eds.) Artificial Intelligence in Education. pp. 588–593. Springer International Publishing, Cham (2022). https://doi.org/10...

  7. [7]

    Schneider, B., Pea, R.: Real-time mutual gaze perception enhances collaborative learning and collaboration quality. Intern. J. Comput.-Support. Collab. Learn. 8, 375–397 (2013). https://doi.org/10.1007/s11412-013-9181-4. 14 J. Liang et al

  8. [8]

    In: Von Davier, A.A., Zhu, M., and Kyllonen, P.C

    Olsen, J.K., Aleven, V., Rummel, N.: Exploring Dual Eye Tracking as a Tool to Assess Collaboration. In: Von Davier, A.A., Zhu, M., and Kyllonen, P.C. (eds.) Innovative Assess-ment of Collaboration. pp. 157–172. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-33261-1_10

  9. [9]

    Liu, Q., Jiang, X., Jiang, R.: Classroom Behavior Recognition Using Computer Vision: A Systematic Review. Sensors. 25, 373 (2025). https://doi.org/10.3390/s25020373

  10. [10]

    Carroll, M., Ruble, M., Dranias, M., Rebensky, S., Chaparro, M., Chiang, J., Winslow, B.: Automatic Detection of Learner Engagement Using Machine Learning and Wearable Sen-sors. JBBS. 10, 165–178 (2020). https://doi.org/10.4236/jbbs.2020.103010

  11. [11]

    Schneider, B., Sharma, K., Cuendet, S., Zufferey, G., Dillenbourg, P., Pea, R.: Leveraging mobile eye-trackers to capture joint visual attention in co-located collaborative learning groups. Intern. J. Comput.-Support. Collab. Learn. 13, 241–261 (2018). https://doi.org/10.1007/s11412-018-9281-2

  12. [12]

    In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., Torralba, A.: Eye Tracking for Everyone. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2176–2184. IEEE, Las Vegas, NV, USA (2016). https://doi.org/10.1109/CVPR.2016.239

  13. [13]

    https://doi.org/10.48550/ARXIV.2508.15782

    Penchala, S., Kontham, S.R., Bhattacharjee, P., Mahmoodi, N., Fonseca, D., Karami, S., Ghahremani, M., Perkins, A.D., Rahimi, S., Golilarz, N.A.: Learning in Focus: Detecting Behavioral and Collaborative Engagement Using Vision Transformers, https://arxiv.org/abs/2508.15782, (2025). https://doi.org/10.48550/ARXIV.2508.15782

  14. [14]

    Machine Vision and Applica-tions

    Huang, Q., Veeraraghavan, A., Sabharwal, A.: TabletGaze: dataset and analysis for uncon-strained appearance-based gaze estimation in mobile tablets. Machine Vision and Applica-tions. 28, 445–461 (2017). https://doi.org/10.1007/s00138-017-0852-4

  15. [15]

    moco , url=

    Chong, E., Wang, Y., Ruiz, N., Rehg, J.M.: Detecting Attended Visual Targets in Video. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5395–5405. IEEE, Seattle, WA, USA (2020). https://doi.org/10.1109/CVPR42600.2020.00544

  16. [16]

    https://doi.org/10.48550/ARXIV.2509.25164

    Sapkota, R., Cheppally, R.H., Sharda, A., Karkee, M.: YOLO26: Key Architectural En-hancements and Performance Benchmarking for Real-Time Object Detection, https://arxiv.org/abs/2509.25164, (2025). https://doi.org/10.48550/ARXIV.2509.25164

  17. [17]

    YOLOv11: An Overview of the Key Architectural Enhancements

    Khanam, R., Hussain, M.: YOLOv11: An Overview of the Key Architectural Enhance-ments, https://arxiv.org/abs/2410.17725, (2024). https://doi.org/10.48550/ARXIV.2410.17725

  18. [18]

    https://doi.org/10.48550/ARXIV.2412.09586

    Ryan, F., Bati, A., Lee, S., Bolya, D., Hoffman, J., Rehg, J.M.: Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders, https://arxiv.org/abs/2412.09586, (2024). https://doi.org/10.48550/ARXIV.2412.09586

  19. [19]

    15, 3133–3181 (2014)

    Fernandez-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of clas-sifiers to solve real world classification problems? The Journal of Machine Learning Re-search. 15, 3133–3181 (2014)

  20. [20]

    In: Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

    Yacouby, R., Axman, D.: Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. In: Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. pp. 79–91. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.eval4nlp-1.9

  21. [21]

    Rainio, O., Teuho, J., Klén, R.: Evaluation metrics and statistical tests for machine learning. Sci Rep. 14, 6086 (2024). https://doi.org/10.1038/s41598-024-56706-x

  22. [22]

    In: Proceedings of the 2018 Detecting Gaze Behaviours in Face-to-Face Collaborative Learning 15 ACM Symposium on Eye Tracking Research & Applications

    Müller, P., Huang, M.X., Zhang, X., Bulling, A.: Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In: Proceedings of the 2018 Detecting Gaze Behaviours in Face-to-Face Collaborative Learning 15 ACM Symposium on Eye Tracking Research & Applications. pp. 1–10. ACM, Warsaw Po-land (2018). https://doi.org/10...

  23. [23]

    https://doi.org/10.1007/978-3-031-26316-3_4

    13844, 52–70 (2023). https://doi.org/10.1007/978-3-031-26316-3_4

  24. [24]

    IEEE Trans

    Stiefelhagen, R., Jie Yang, Waibel, A.: Modeling focus of attention for meeting indexing based on multiple cues. IEEE Trans. Neural Netw. 13, 928–938 (2002). https://doi.org/10.1109/TNN.2002.1021893

  25. [25]

    Applied Sciences

    Rodríguez-Ortiz, M.Á., Santana-Mancilla, P.C., Anido-Rifón, L.E.: Machine Learning and Generative AI in Learning Analytics for Higher Education: A Systematic Review of Models, Trends, and Challenges. Applied Sciences. 15, 8679 (2025). https://doi.org/10.3390/app15158679

  26. [26]

    Computers and Ed-ucation Open

    Mathrani, A., Susnjak, T., Ramaswami, G., Barczak, A.: Perspectives on the challenges of generalizability, transparency and ethics in predictive learning analytics. Computers and Ed-ucation Open. 2, 100060 (2021). https://doi.org/10.1016/j.caeo.2021.100060

  27. [27]

    In: LAK22: 12th International Learning Ana-lytics and Knowledge Conference

    Yan, L., Zhao, L., Gasevic, D., Martinez-Maldonado, R.: Scalability, Sustainability, and Ethicality of Multimodal Learning Analytics. In: LAK22: 12th International Learning Ana-lytics and Knowledge Conference. pp. 13–23. ACM, Online USA (2022). https://doi.org/10.1145/3506860.3506862

  28. [28]

    Computer Assisted Learning

    Suraworachet, W., Zhou, Q., Cukurova, M.: University Students’ Perceptions of a Multi-modal AI System for Real-World Collaboration Analytics: Lessons Learned from a Case Study. Computer Assisted Learning. 41, e70103 (2025). https://doi.org/10.1111/jcal.70103

  29. [29]

    Learning An-alytics

    Whitehead, R., Nguyen, A., Järvelä, S.: Utilizing Multimodal Large Language Models for Video Analysis of Posture in Studying Collaborative Learning: A Case Study. Learning An-alytics. 12, 186–200 (2025). https://doi.org/10.18608/jla.2025.8595