Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning

Chia-Ming Lee; Wen-Hsin Tsai; Yuk-Ying Tung

arxiv: 2605.16806 · v1 · pith:SWD46MLNnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CV

Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning

Wen-Hsin Tsai , Chia-Ming Lee , Yuk-Ying Tung This is my paper

Pith reviewed 2026-05-19 21:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords multimodal learning analyticsgame-based learningcollaboration satisfactioncross-modal affinitycontrastive learningmodality degradationeducational data fusionstudent collaboration

0 comments

The pith

A module using affinity matrices and contrastive learning to align and selectively suppress unreliable data sources improves predictions of student collaboration satisfaction in game-based learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Predicting how satisfied students feel about working together in small-group educational games is difficult because data from cameras and logs often varies in quality across participants. The paper introduces the AAMLA framework to combine facial action units, head pose, eye gaze, and interaction logs while addressing cases where one or more sources become uninformative. Its central CAMA module builds affinity matrices that relate the different feature types after they are projected into a common space and then applies contrastive learning to enforce consistency across modalities. This setup lets the model downweight problematic inputs without discarding them entirely, producing more stable results than single-modality models or earlier attention-based fusion techniques on data from fifty middle-school students.

Core claim

The Cross-modal Affinity-guided Modality Alignment (CAMA) module explicitly models inter-modal relationships via affinity matrices and enforces cross-modal consistency through contrastive learning, enabling adaptive suppression of uninformative modalities without discarding them and yielding consistent improvements over unimodal baselines and prior cross-attention approaches.

What carries the argument

The Cross-modal Affinity-guided Modality Alignment (CAMA) module, which constructs affinity matrices from projected heterogeneous features and applies contrastive learning to enforce cross-modal consistency for adaptive suppression of uninformative inputs.

If this is right

Higher prediction accuracy for student collaboration satisfaction than unimodal baselines or prior cross-attention methods on the fifty-student dataset.
More stable performance when individual modalities such as eye gaze exhibit inconsistent informativeness across participants.
Generation of robust cross-modal representations that remain interpretable under SHAP and t-SNE inspection.
Retention of all input modalities while adaptively reducing the influence of unhelpful ones rather than requiring explicit removal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment technique could be applied to other classroom sensor streams where data quality fluctuates with student behavior or equipment condition.
Embedding the module in live game environments might enable automatic prompts that help groups whose collaboration signals have weakened.
Scaling the approach to larger or more diverse student populations would test whether the affinity construction generalizes beyond the current middle-school sample.
Systems built this way could lower the barrier to deploying multimodal analytics in ordinary classrooms by reducing reliance on perfectly functioning sensors.

Load-bearing premise

That affinity matrices derived from the heterogeneous features together with contrastive learning will reliably capture and mitigate modality degradation on this dataset without introducing artifacts that affect the downstream satisfaction prediction.

What would settle it

If ablation experiments on the same fifty-student EcoJourneys data show that disabling the affinity matrices or the contrastive loss produces no drop in performance under controlled modality degradation, the claimed benefit of the alignment mechanism would be refuted.

Figures

Figures reproduced from arXiv: 2605.16806 by Chia-Ming Lee, Wen-Hsin Tsai, Yuk-Ying Tung.

**Figure 1.** Figure 1: Overview of the proposed AAMLA framework. Four modality streams (facial action units, head pose, eye gaze, trace logs) are encoded by modality-specific encoders and projected into a unified d = 128 semantic space. The CAMA module explicitly models inter-modal relationships via affinity matrices and contrastive loss Laff , suppressing uninformative modalities. Aligned embeddings are classified by a FC head … view at source ↗

**Figure 2.** Figure 2: Pipeline of the proposed CAMA strategy. Different shapes denote different modalities; color denotes satisfaction class (red: high; blue: low). CAMA pulls same-class modality embeddings together and pushes apart different-class embeddings via affinity matrices, transforming scattered unaligned features (left) into compact, semantically coherent clusters (right) robust to uninformative modalities such as g… view at source ↗

**Figure 3.** Figure 3: Student communication while playing in the EcoJourneys collaborative learning environment [1]. Students work in small groups to investigate a fish illness on a virtual Philippine island, generating rich multimodal behavioral signals — including facial expressions, head pose, eye gaze, and in-game chat interactions — that our AAMLA framework leverages for collaboration satisfaction prediction. Unlike prior … view at source ↗

**Figure 4.** Figure 4: t-SNE visualizations of multimodal feature distributions under different ablation settings. Color denotes satisfaction class (Sat-1 to Sat-4); marker shape denotes modality (AU, Pose, Gaze, Trace). (a) The full AAMLA model produces tightly clustered, semantically aligned features with clear inter-class separation. (b) Removing CAMA causes cross-modality drift and partial overlap between satisfaction class… view at source ↗

**Figure 5.** Figure 5: Affinity scores evolution among AU, pose, gaze, and trace modalities across {high-satisfaction, low-satisfaction} and {original, degraded} conditions during training. High-satisfaction activities achieve stable convergence earlier, reflecting more consistent cross-modal alignment, while degraded conditions exhibit higher variance, particularly for gaze features, motivating the explicit alignment enforced … view at source ↗

**Figure 6.** Figure 6: SHAP beeswarm plot of feature contributions. Color denotes normalized feature value. Trace features rank highest, gaze features are absent from the top-20, corroborating CAMA’s adaptive suppression of uninformative modalities. satisfaction prediction. AU and pose features show moderate contributions, while gaze features are absent from the top-20 — directly validating CAMA’s adaptive suppression of uninfo… view at source ↗

read the original abstract

Collaborative game-based learning environments offer rich opportunities for small-group knowledge construction, yet automatically predicting student collaboration satisfaction remains challenging. A critical barrier is modality degradation: in educational deployments, individual modalities such as eye gaze exhibit inconsistent informativeness across student cohorts, causing implicit attention-based fusion to produce brittle multimodal representations. We propose the Affinity-Aligned Multimodal Learning Analytics (AAMLA) framework, whose core contribution is the Cross-modal Affinity-guided Modality Alignment (CAMA) module, which explicitly models inter-modal relationships via affinity matrices and enforces cross-modal consistency through contrastive learning, enabling adaptive suppression of uninformative modalities without discarding them. AAMLA further applies modality-specific projection layers to map heterogeneous features, including facial action units, head pose, eye gaze, and interaction trace logs, into a unified semantic space prior to alignment. Experiments on 50 middle school students in the EcoJourneys collaborative learning environment demonstrate consistent improvements over unimodal baselines and prior cross-attention approaches under standard and modality degradation conditions, with SHAP and t-SNE analyses confirming that CAMA produces robust, interpretable cross-modal representations for student collaboration modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The CAMA module pairs affinity matrices with contrastive learning to handle modality dropouts in small-scale student collaboration data, but the n=50 sample makes it hard to tell if the gains are robust or cohort-specific.

read the letter

The paper introduces the CAMA module inside an AAMLA framework. It builds affinity matrices across facial action units, head pose, eye gaze, and interaction logs, then applies contrastive learning to keep the aligned features consistent. The goal is to let the model down-weight uninformative modalities on the fly instead of dropping them outright, which is meant to improve predictions of collaboration satisfaction in a game-based setting like EcoJourneys.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Affinity-Aligned Multimodal Learning Analytics (AAMLA) framework whose core is the Cross-modal Affinity-guided Modality Alignment (CAMA) module. CAMA constructs affinity matrices from heterogeneous features (facial action units, head pose, eye gaze, interaction logs) and applies contrastive learning to enforce cross-modal consistency, allowing adaptive suppression of uninformative modalities. Modality-specific projection layers map features into a shared space. On data from 50 middle-school students in the EcoJourneys environment, the approach reportedly yields consistent gains over unimodal baselines and prior cross-attention methods under both standard and modality-degradation conditions, with supporting SHAP and t-SNE interpretability analyses.

Significance. If the improvements are reproducible and generalizable, the work addresses a practically relevant problem in educational multimodal learning analytics by providing a mechanism for robust fusion without explicit modality dropping. Credit is due for the explicit use of affinity matrices plus contrastive objectives and for including interpretability analyses (SHAP, t-SNE). However, the small cohort size and absence of detailed quantitative results constrain the strength of the contribution.

major comments (2)

[Experiments] Experiments section (and abstract): the central claim of 'consistent improvements' is asserted without any reported numerical metrics (accuracy, F1, correlation, AUC), error bars, ablation tables, or statistical significance tests. This absence prevents assessment of effect size and leaves the evidence for CAMA's superiority at a high-level assertion only.
[Dataset and Evaluation] Dataset and Evaluation: with n=50 students from a single EcoJourneys deployment, student-wise cross-validation alone does not rule out overfitting to cohort-specific correlations or to the particular degradation simulation used. No external validation cohort or larger-scale replication is described, which directly affects the load-bearing claim that the affinity-plus-contrastive mechanism produces transferable modality reliability.

minor comments (2)

[Method] Clarify the precise mathematical definition of the affinity matrices (e.g., how they are computed from the four feature streams) and the exact contrastive loss formulation used in CAMA.
[Method] Provide the dimensions of the modality-specific projection layers and the hyperparameter values for the contrastive loss; these are listed as free parameters but not reported.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and for recognizing the relevance of robust multimodal fusion in educational settings. We address each major comment below, making revisions where feasible while being transparent about study limitations.

read point-by-point responses

Referee: [Experiments] Experiments section (and abstract): the central claim of 'consistent improvements' is asserted without any reported numerical metrics (accuracy, F1, correlation, AUC), error bars, ablation tables, or statistical significance tests. This absence prevents assessment of effect size and leaves the evidence for CAMA's superiority at a high-level assertion only.

Authors: We agree that explicit quantitative reporting is essential to allow proper evaluation of effect sizes and statistical reliability. Although the full manuscript contains results tables in Section 4, these were not sufficiently foregrounded. In the revision we have expanded the Experiments section with a dedicated results table reporting accuracy, F1, AUC, and correlation values (with standard deviations across student-wise folds), added ablation tables comparing CAMA against cross-attention baselines, and included paired t-test p-values. The abstract has also been updated to cite the key numerical gains under both standard and degradation conditions. revision: yes
Referee: [Dataset and Evaluation] Dataset and Evaluation: with n=50 students from a single EcoJourneys deployment, student-wise cross-validation alone does not rule out overfitting to cohort-specific correlations or to the particular degradation simulation used. No external validation cohort or larger-scale replication is described, which directly affects the load-bearing claim that the affinity-plus-contrastive mechanism produces transferable modality reliability.

Authors: This is a valid concern. The modest cohort size and single-environment source limit strong claims of broad transferability, even with student-wise cross-validation. We have added an explicit Limitations subsection that discusses potential cohort-specific correlations, the simulated nature of modality degradation, and the consequent scope of our generalizability claims. We have also clarified the design rationale for the affinity matrices and contrastive objective in promoting robustness. However, we cannot introduce an external validation cohort or larger replication within the current revision, as that would require new data collection. revision: partial

standing simulated objections not resolved

Absence of an external validation cohort or larger-scale replication, which cannot be addressed without new data collection

Circularity Check

0 steps flagged

No circularity: method uses standard contrastive alignment on projected features with empirical validation

full rationale

The paper proposes the AAMLA framework centered on the CAMA module, which computes affinity matrices from heterogeneous features (facial action units, head pose, eye gaze, interaction logs) after modality-specific projections and applies contrastive learning for cross-modal consistency. No equations, derivations, or self-citations are shown that reduce the claimed adaptive suppression or prediction improvements to fitted parameters or inputs by construction. The approach relies on standard contrastive objectives evaluated via experiments on the EcoJourneys dataset of 50 students, with comparisons to unimodal and cross-attention baselines, making the central claims empirically grounded and self-contained rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on standard machine-learning assumptions about feature projection and contrastive objectives plus one new module whose effectiveness is demonstrated only on the internal dataset.

free parameters (1)

Projection layer dimensions and contrastive loss hyperparameters
Required to map heterogeneous modalities into unified space and train the alignment; values not reported in abstract.

axioms (1)

domain assumption Heterogeneous multimodal features can be mapped into a single semantic space where affinity relationships are meaningful.
Invoked when describing modality-specific projection layers prior to CAMA.

invented entities (1)

CAMA module no independent evidence
purpose: To compute affinity matrices and enforce cross-modal consistency via contrastive learning for adaptive modality weighting.
Newly proposed component whose independent validation outside this study is not provided.

pith-pipeline@v0.9.0 · 5740 in / 1427 out tokens · 48568 ms · 2026-05-19T21:09:53.457556+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

Halim Acosta, Seung Lee, Bradford Mott, Haesol Bae, Krista Glazewski, Cindy Hmelo-Silver, and James C. Lester. Multimodal learning analytics for predicting student collab- oration satisfaction in collaborative game-based learning. In Proceedings of the 17th International Conference on Edu- cational Data Mining, pages 224–235. International Educa- tional D...

work page 2024
[2]

Waleed Mugahed Al-Rahmi and Mohd Shahizan Othman. Evaluating student’s satisfaction of using social media through collaborative learning in higher education.Interna- tional Journal of Advances in Engineering & Technology, 6 (4):15–41, 2013. 1 and 2

work page 2013
[3]

Comparing collabo- rative and cooperative gameplay for academic and gaming achievements.Journal of Educational Computing Research, 57(8):2110–2140, 2020

Youngkyun Baek and Ahmed Touati. Comparing collabo- rative and cooperative gameplay for academic and gaming achievements.Journal of Educational Computing Research, 57(8):2110–2140, 2020. 2

work page 2020
[4]

Openface 2.0: Facial behavior analysis toolkit

Tadas Baltru ˇsaitis, Amir Zadeh, Yao Chong Lim, and Louis- Philippe Morency. Openface 2.0: Facial behavior analysis toolkit. In2018 IEEE International Conference on Auto- matic Face and Gesture Recognition, pages 59–66. IEEE,

work page
[5]

Classifying confusion: Autodetec- tion of communicative misunderstandings using facial action units

Nuno Borges, Leif Lindblom, Benjamin Clarke, Anna Gan- der, and Rodney Lowe. Classifying confusion: Autodetec- tion of communicative misunderstandings using facial action units. In2019 Affective Computing and Intelligent Interac- tion Workshops and Demos, pages 401–406, 2019. 3

work page 2019
[6]

Automatic detection of collabo- rative states in small groups using multimodal features

Matthew Bradford, Imene Khebour, Nathan Blanchard, and Nirmalya Krishnaswamy. Automatic detection of collabo- rative states in small groups using multimodal features. In Proceedings of the 24th International Conference on Artifi- cial Intelligence in Education, pages 767–773, 2023. 2

work page 2023
[7]

Mengxue Cai and Clayton D. Epp. Modeling cognitive load and affect to support adaptive online learning. InProceed- ings of the 15th International Conference on Educational Data Mining, pages 799–804, 2022. 2

work page 2022
[8]

Mott, Abeer Saleh, Krista D

Dustin Carpenter, Andrew Emerson, Bradford W. Mott, Abeer Saleh, Krista D. Glazewski, Cindy E. Hmelo-Silver, and James C. Lester. Detecting off-task behavior from stu- dent dialogue in game-based collaborative learning. pages 55–66. Springer, 2020. 2, 3, 5, and 6

work page 2020
[9]

Prieto, Mar ´ıa J

Pankaj Chejara, Luis P. Prieto, Mar ´ıa J. Rodr ´ıguez- Triana, ´Angel Ruiz-Calleja, Riin Kasepalu, Ioanna-Angeliki Chounta, and Barbara Schneider. Exploring indicators for collaboration quality and its dimensions in classroom set- tings using multimodal learning analytics. InEuropean Con- ference on Technology Enhanced Learning, pages 60–74. Springer, 2023. 3

work page 2023
[10]

Intriguing properties of contrastive losses

Ting Chen, Calvin Luo, and Lala Li. Intriguing properties of contrastive losses. InNeurIPS, 2021. 1 and 2

work page 2021
[11]

A bibliometric analysis of game-based collabora- tive learning between 2000 and 2019.International Journal of Mobile Learning and Organisation, 16(1):20–51, 2022

Xiaoqing Chen, Di Zou, Haoran Xie, Gong Cheng, and Fanyu Su. A bibliometric analysis of game-based collabora- tive learning between 2000 and 2019.International Journal of Mobile Learning and Organisation, 16(1):20–51, 2022. 2

work page 2000
[12]

Imen Daoudi, Emmanuel Tranvouez, Rihab Chebil, Bernard Espinasse, and Wajdi L. Chaari. An EDM-based multimodal method for assessing learners’ affective states in collabora- tive crisis management serious games. InProceedings of the 13th International Conference on Educational Data Mining,

work page
[13]

Reilly, and Barbara Schneider

Yael Dich, Jennifer M. Reilly, and Barbara Schneider. Us- ing physiological synchrony as an indicator of collaboration quality, task performance and learning. InArtificial Intel- ligence in Education: 19th International Conference, AIED 2018, pages 98–110. Springer, 2018. 1

work page 2018
[14]

Kingma Diederik and Jimmy Ba

P. Kingma Diederik and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 6

work page 2017
[15]

Gamification and game-based learning as coopera- tive learning tools: A systematic review.International Jour- nal of Emerging Technologies in Learning (iJET), 18(21): 4–23, 2023

Iv ´an Fonseca, Manuel Caviedes, Juan Chantr ´e, and Jaime Bernate. Gamification and game-based learning as coopera- tive learning tools: A systematic review.International Jour- nal of Emerging Technologies in Learning (iJET), 18(21): 4–23, 2023. 1 and 2

work page 2023
[16]

Lamm: Label alignment for multi-modal prompt learning.arXiv preprint arXiv:2312.08212, 2023

Jingsheng Gao, Jiacheng Ruan, Suncheng Xiang, Zefang Yu, Ke Ji, Mingye Xie, Ting Liu, and Yuzhuo Fu. Lamm: Label alignment for multi-modal prompt learning.arXiv preprint arXiv:2312.08212, 2023. 2

work page arXiv 2023
[17]

Griffith, Grace A

Amy E. Griffith, Grace A. Katuka, Joseph B. Wiggins, Kristy E. Boyer, Jason Freeman, Brian Magerko, and Tay- lor McKlin. Investigating the relationship between dialogue states and partner satisfaction during co-creative learning tasks.International Journal of Artificial Intelligence in Edu- cation, 33(3):543–582, 2023. 2 and 3

work page 2023
[18]

Zhongyang Guo and Reza Barmaki. Deep neural networks for collaborative learning analytics: Evaluating team collab- orations using student gaze point prediction.Australasian Journal of Educational Technology, 36(6):53–71, 2020. 1, 2, and 3

work page 2020
[19]

Harris, Penny Van Bergen, Samantha A

Celeste B. Harris, Penny Van Bergen, Samantha A. Harris, Natalie McIlwain, and Aline Arguel. Here’s looking at you: eye gaze and collaborative recall.Psychological Research, 86:769–779, 2022. 1, 2, and 3

work page 2022
[20]

Gifted students’ learning experiences in systematic game development pro- cess in after-school activities.Educational Technology Re- search and Development, 68:1439–1459, 2020

Kenan Hava, Tolga Guyer, and Hasan Cakir. Gifted students’ learning experiences in systematic game development pro- cess in after-school activities.Educational Technology Re- search and Development, 68:1439–1459, 2020. 1

work page 2020
[21]

Structure-clip: To- wards scene graph knowledge to enhance multi-modal struc- tured representations

Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, and Wen Zhang. Structure-clip: To- wards scene graph knowledge to enhance multi-modal struc- tured representations. InAAAI, 2024. 2

work page 2024
[22]

Umcl: Unimodal-generated multimodal contrastive learning for cross-compression-rate deepfake detection.Int

Ching-Yi Lai, Chih-Yu Jian, Pei-Cheng Chuang, Chia-Ming Lee, Chih-Chung Hsu, Chiou-Ting Hsu, and Chia-Wen Lin. Umcl: Unimodal-generated multimodal contrastive learning for cross-compression-rate deepfake detection.Int. J. Com- put. Vis., 134:40, 2026. 2 and 7

work page 2026
[23]

Le and Simon S

Binh M. Le and Simon S. Woo. Quality-agnostic deepfake detection with intra-model collaborative learning. InICCV, pages 22321–22332, 2023. 2 and 6

work page 2023
[24]

Looking into your speech: Learning cross-modal affinity for audio-visual speech sep- aration

Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, and Kwanghoon Sohn. Looking into your speech: Learning cross-modal affinity for audio-visual speech sep- aration. InCVPR, 2021. 2

work page 2021
[25]

Soo Jeoung Lee, Siva Srinivasan, Tracy Trail, David Lewis, and Suzanne Lopez. Examining the relationship among stu- dent perception of support, course satisfaction, and learning outcomes in online learning.The Internet and Higher Edu- cation, 14(3):158–163, 2011. 1 and 2

work page 2011
[26]

Jie Li, Yu Lin, Ming Sun, and Rustam Shadiev. Socially shared regulation of learning in game-based collaborative learning environments promotes algorithmic thinking, learn- ing participation and positive learning attitudes.Interactive Learning Environments, 31(3):1715–1726, 2023. 2

work page 2023
[27]

Morariu, Handong Zhao, Rahul Jain, Varun Manjunatha, and Hui- juan Liu

Peizhong Li, Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Rahul Jain, Varun Manjunatha, and Hui- juan Liu. Selfdoc: Self-supervised document representation learning. InCVPR, pages 5652–5660, 2021. 1, 2, 3, 4, and 5

work page 2021
[28]

Hsin-Yu Liang, Tsung-Yen Hsu, Gwo-Jen Hwang, Shih- Chun Chang, and Hsiao-Chen Chu. A mandatory contribution-based collaborative gaming approach to en- hancing students’ collaborative learning outcomes in science museums.Interactive Learning Environments, 31(5):2692– 2706, 2023. 2

work page 2023
[29]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Ye- ung, and James Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. InNeurIPS, 2022. 2

work page 2022
[30]

Smil: Multimodal learning with severely missing modality

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. Smil: Multimodal learning with severely missing modality. InAAAI, pages 2302–2310, 2021. 1, 2, and 6

work page 2021
[31]

Detecting impasse during collaborative problem solving with multimodal learning analytics

Yuxin Ma, Mehmet Celepkolu, and Kristy Elizabeth Boyer. Detecting impasse during collaborative problem solving with multimodal learning analytics. InLAK22: 12th International Learning Analytics and Knowledge Conference, pages 45– 55, 2022. 2

work page 2022
[32]

Katuka, Mehmet Celepkolu, and Kristy Elizabeth Boyer

Yuxin Ma, Grace A. Katuka, Mehmet Celepkolu, and Kristy Elizabeth Boyer. Investigating multimodal predic- tors of peer satisfaction for collaborative coding in middle school. InProceedings of the 15th International Conference on Educational Data Mining. International Educational Data Mining Society, 2022. 2, 3, and 4

work page 2022
[33]

Facial fea- tures for affective state detection in learning environments

Bradley McDaniel, Sidney D’Mello, Brent King, Patrick Chipman, Kristopher Tapp, and Arthur Graesser. Facial fea- tures for affective state detection in learning environments. InProceedings of the Annual Meeting of the Cognitive Sci- ence Society, 2007. 3

work page 2007
[34]

Olsen, Kshitij Sharma, Nikol Rummel, and Vin- cent Aleven

Jennifer K. Olsen, Kshitij Sharma, Nikol Rummel, and Vin- cent Aleven. Temporal analysis of multimodal data to predict collaborative learning outcomes.British Journal of Educa- tional Technology, 51(5):1527–1547, 2020. 2 and 3

work page 2020
[35]

Towards collaborative convergence: quantifying collaboration quality with auto- mated co-located collaboration analytics

Satyapriya Praharaj, Maren Scheffel, Martin Schmitz, Mar- cus Specht, and Hendrik Drachsler. Towards collaborative convergence: quantifying collaboration quality with auto- mated co-located collaboration analytics. InLAK22: 12th In- ternational Learning Analytics and Knowledge Conference, pages 358–369, 2022. 2

work page 2022
[36]

Roberto U. Puga. Game-based learning: a tool that enhances the collaborative work. InEuropean Conference on Games Based Learning, pages 570–577, 2022. 2

work page 2022
[37]

Toward collaboration sens- ing.International Journal of Computer-Supported Collabo- rative Learning, 9:371–395, 2014

Barbara Schneider and Roy Pea. Toward collaboration sens- ing.International Journal of Computer-Supported Collabo- rative Learning, 9:371–395, 2014. 1

work page 2014
[38]

Utilizing interactive surfaces to enhance learning, collabo- ration and engagement: Insights from learners’ gaze and speech.Sensors, 20(7):1964, 2020

Kshitij Sharma, Ioannis Leftheriotis, and Michail Giannakos. Utilizing interactive surfaces to enhance learning, collabo- ration and engagement: Insights from learners’ gaze and speech.Sensors, 20(7):1964, 2020. 3

work page 1964
[39]

Hyo-Jeong So and Thomas A. Brush. Student perceptions of collaborative learning, social presence and satisfaction in a blended learning environment: Relationships and critical factors.Computers & Education, 51(1):318–336, 2008. 1 and 2

work page 2008
[40]

Starr, Jennifer M

Emily L. Starr, Jennifer M. Reilly, and Barbara Schnei- der. Toward using multi-modal learning analytics to sup- port and measure collaboration in co-located dyads. InICLS 2018: 13th International Conference of the Learning Sci- ences, pages 448–455. International Society of the Learning Sciences, 2018. 2

work page 2018
[41]

Stewart, Zachary Keirn, and Sidney K

Andrew E. Stewart, Zachary Keirn, and Sidney K. D’Mello. Multimodal modeling of collaborative problem-solving facets in triads.User Modeling and User-Adapted Interac- tion, 31(4):713–751, 2021. 2

work page 2021
[42]

Multimodal engagement analysis from facial videos in the classroom

¨Ozg¨ur S ¨umer, Paul Goldberg, Sidney D’Mello, Peter Ger- jets, Ulrich Trautwein, and Enkelejda Kasneci. Multimodal engagement analysis from facial videos in the classroom. IEEE Transactions on Affective Computing, 14(2):1012– 1027, 2021. 3

work page 2021
[43]

Visualizing data using t-SNE.Journal of Machine Learning Research, 9 (11), 2008

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.Journal of Machine Learning Research, 9 (11), 2008. 2

work page 2008
[44]

Chao Wang and Lijuan Huang. A systematic review of se- rious games for collaborative learning: Theoretical frame- work, game mechanic and efficiency assessment.Interna- tional Journal of Emerging Technologies in Learning, 16(6): 88–105, 2021. 2

work page 2021
[45]

Multi-modal learning with missing modality via shared-specific feature modelling

Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi-modal learning with missing modality via shared-specific feature modelling. In CVPR, pages 15878–15887, 2023. 1, 2, and 6

work page 2023
[46]

Connecting multi-modal con- trastive representations

Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jia- geng Liu, Aoxiong Yin, Li Tang, Linjun Li, Yongqi Wang, Ziang Zhang, and Zhou Zhao. Connecting multi-modal con- trastive representations. InNeurIPS, 2023. 1 and 2

work page 2023
[47]

Mmap: multi-modal alignment prompt for cross- domain multi-task learning

Yi Xin, Junlong Du, Qiang Wang, Ke Yan, and Shouhong Ding. Mmap: multi-modal alignment prompt for cross- domain multi-task learning. InAAAI, 2024. 2

work page 2024
[48]

Test-time adaptation against multi-modal reliability bias

Mouxing Yang, Yunfan Li, Changqing Zhang, Peng Hu, and Xi Peng. Test-time adaptation against multi-modal reliability bias. InICLR, 2024. 2

work page 2024
[49]

The effect of ed- ucational games on learning outcomes, student motivation, engagement and satisfaction.Journal of Educational Com- puting Research, 59(3):522–546, 2021

Zhonggen Yu, Ming Gao, and Lili Wang. The effect of ed- ucational games on learning outcomes, student motivation, engagement and satisfaction.Journal of Educational Com- puting Research, 59(3):522–546, 2021. 2

work page 2021
[50]

Abdulazeez Abubakar Yunusa and Ibraheem Nasirudeen Umar. A scoping review of critical predictive factors (CPFs) of satisfaction and perceived learning outcomes in e-learning environments.Education and Information Technologies, 26: 1223–1270, 2021. 2

work page 2021
[51]

Kirschner, and Femke Kirschner

Johanna Zambrano, Paul A. Kirschner, and Femke Kirschner. How cognitive load theory can be applied to col- laborative learning. InAdvances in Cognitive Load Theory: Rethinking Teaching, pages 30–40. 2019. 1

work page 2019
[52]

Student satisfaction, performance, and knowl- edge construction in online collaborative learning.Journal of Educational Technology & Society, 15(1):127–136, 2012

Chang Zhu. Student satisfaction, performance, and knowl- edge construction in online collaborative learning.Journal of Educational Technology & Society, 15(1):127–136, 2012. 1

work page 2012

[1] [1]

Halim Acosta, Seung Lee, Bradford Mott, Haesol Bae, Krista Glazewski, Cindy Hmelo-Silver, and James C. Lester. Multimodal learning analytics for predicting student collab- oration satisfaction in collaborative game-based learning. In Proceedings of the 17th International Conference on Edu- cational Data Mining, pages 224–235. International Educa- tional D...

work page 2024

[2] [2]

Waleed Mugahed Al-Rahmi and Mohd Shahizan Othman. Evaluating student’s satisfaction of using social media through collaborative learning in higher education.Interna- tional Journal of Advances in Engineering & Technology, 6 (4):15–41, 2013. 1 and 2

work page 2013

[3] [3]

Comparing collabo- rative and cooperative gameplay for academic and gaming achievements.Journal of Educational Computing Research, 57(8):2110–2140, 2020

Youngkyun Baek and Ahmed Touati. Comparing collabo- rative and cooperative gameplay for academic and gaming achievements.Journal of Educational Computing Research, 57(8):2110–2140, 2020. 2

work page 2020

[4] [4]

Openface 2.0: Facial behavior analysis toolkit

Tadas Baltru ˇsaitis, Amir Zadeh, Yao Chong Lim, and Louis- Philippe Morency. Openface 2.0: Facial behavior analysis toolkit. In2018 IEEE International Conference on Auto- matic Face and Gesture Recognition, pages 59–66. IEEE,

work page

[5] [5]

Classifying confusion: Autodetec- tion of communicative misunderstandings using facial action units

Nuno Borges, Leif Lindblom, Benjamin Clarke, Anna Gan- der, and Rodney Lowe. Classifying confusion: Autodetec- tion of communicative misunderstandings using facial action units. In2019 Affective Computing and Intelligent Interac- tion Workshops and Demos, pages 401–406, 2019. 3

work page 2019

[6] [6]

Automatic detection of collabo- rative states in small groups using multimodal features

Matthew Bradford, Imene Khebour, Nathan Blanchard, and Nirmalya Krishnaswamy. Automatic detection of collabo- rative states in small groups using multimodal features. In Proceedings of the 24th International Conference on Artifi- cial Intelligence in Education, pages 767–773, 2023. 2

work page 2023

[7] [7]

Mengxue Cai and Clayton D. Epp. Modeling cognitive load and affect to support adaptive online learning. InProceed- ings of the 15th International Conference on Educational Data Mining, pages 799–804, 2022. 2

work page 2022

[8] [8]

Mott, Abeer Saleh, Krista D

Dustin Carpenter, Andrew Emerson, Bradford W. Mott, Abeer Saleh, Krista D. Glazewski, Cindy E. Hmelo-Silver, and James C. Lester. Detecting off-task behavior from stu- dent dialogue in game-based collaborative learning. pages 55–66. Springer, 2020. 2, 3, 5, and 6

work page 2020

[9] [9]

Prieto, Mar ´ıa J

Pankaj Chejara, Luis P. Prieto, Mar ´ıa J. Rodr ´ıguez- Triana, ´Angel Ruiz-Calleja, Riin Kasepalu, Ioanna-Angeliki Chounta, and Barbara Schneider. Exploring indicators for collaboration quality and its dimensions in classroom set- tings using multimodal learning analytics. InEuropean Con- ference on Technology Enhanced Learning, pages 60–74. Springer, 2023. 3

work page 2023

[10] [10]

Intriguing properties of contrastive losses

Ting Chen, Calvin Luo, and Lala Li. Intriguing properties of contrastive losses. InNeurIPS, 2021. 1 and 2

work page 2021

[11] [11]

A bibliometric analysis of game-based collabora- tive learning between 2000 and 2019.International Journal of Mobile Learning and Organisation, 16(1):20–51, 2022

Xiaoqing Chen, Di Zou, Haoran Xie, Gong Cheng, and Fanyu Su. A bibliometric analysis of game-based collabora- tive learning between 2000 and 2019.International Journal of Mobile Learning and Organisation, 16(1):20–51, 2022. 2

work page 2000

[12] [12]

Imen Daoudi, Emmanuel Tranvouez, Rihab Chebil, Bernard Espinasse, and Wajdi L. Chaari. An EDM-based multimodal method for assessing learners’ affective states in collabora- tive crisis management serious games. InProceedings of the 13th International Conference on Educational Data Mining,

work page

[13] [13]

Reilly, and Barbara Schneider

Yael Dich, Jennifer M. Reilly, and Barbara Schneider. Us- ing physiological synchrony as an indicator of collaboration quality, task performance and learning. InArtificial Intel- ligence in Education: 19th International Conference, AIED 2018, pages 98–110. Springer, 2018. 1

work page 2018

[14] [14]

Kingma Diederik and Jimmy Ba

P. Kingma Diederik and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 6

work page 2017

[15] [15]

Gamification and game-based learning as coopera- tive learning tools: A systematic review.International Jour- nal of Emerging Technologies in Learning (iJET), 18(21): 4–23, 2023

Iv ´an Fonseca, Manuel Caviedes, Juan Chantr ´e, and Jaime Bernate. Gamification and game-based learning as coopera- tive learning tools: A systematic review.International Jour- nal of Emerging Technologies in Learning (iJET), 18(21): 4–23, 2023. 1 and 2

work page 2023

[16] [16]

Lamm: Label alignment for multi-modal prompt learning.arXiv preprint arXiv:2312.08212, 2023

Jingsheng Gao, Jiacheng Ruan, Suncheng Xiang, Zefang Yu, Ke Ji, Mingye Xie, Ting Liu, and Yuzhuo Fu. Lamm: Label alignment for multi-modal prompt learning.arXiv preprint arXiv:2312.08212, 2023. 2

work page arXiv 2023

[17] [17]

Griffith, Grace A

Amy E. Griffith, Grace A. Katuka, Joseph B. Wiggins, Kristy E. Boyer, Jason Freeman, Brian Magerko, and Tay- lor McKlin. Investigating the relationship between dialogue states and partner satisfaction during co-creative learning tasks.International Journal of Artificial Intelligence in Edu- cation, 33(3):543–582, 2023. 2 and 3

work page 2023

[18] [18]

Zhongyang Guo and Reza Barmaki. Deep neural networks for collaborative learning analytics: Evaluating team collab- orations using student gaze point prediction.Australasian Journal of Educational Technology, 36(6):53–71, 2020. 1, 2, and 3

work page 2020

[19] [19]

Harris, Penny Van Bergen, Samantha A

Celeste B. Harris, Penny Van Bergen, Samantha A. Harris, Natalie McIlwain, and Aline Arguel. Here’s looking at you: eye gaze and collaborative recall.Psychological Research, 86:769–779, 2022. 1, 2, and 3

work page 2022

[20] [20]

Gifted students’ learning experiences in systematic game development pro- cess in after-school activities.Educational Technology Re- search and Development, 68:1439–1459, 2020

Kenan Hava, Tolga Guyer, and Hasan Cakir. Gifted students’ learning experiences in systematic game development pro- cess in after-school activities.Educational Technology Re- search and Development, 68:1439–1459, 2020. 1

work page 2020

[21] [21]

Structure-clip: To- wards scene graph knowledge to enhance multi-modal struc- tured representations

Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, and Wen Zhang. Structure-clip: To- wards scene graph knowledge to enhance multi-modal struc- tured representations. InAAAI, 2024. 2

work page 2024

[22] [22]

Umcl: Unimodal-generated multimodal contrastive learning for cross-compression-rate deepfake detection.Int

Ching-Yi Lai, Chih-Yu Jian, Pei-Cheng Chuang, Chia-Ming Lee, Chih-Chung Hsu, Chiou-Ting Hsu, and Chia-Wen Lin. Umcl: Unimodal-generated multimodal contrastive learning for cross-compression-rate deepfake detection.Int. J. Com- put. Vis., 134:40, 2026. 2 and 7

work page 2026

[23] [23]

Le and Simon S

Binh M. Le and Simon S. Woo. Quality-agnostic deepfake detection with intra-model collaborative learning. InICCV, pages 22321–22332, 2023. 2 and 6

work page 2023

[24] [24]

Looking into your speech: Learning cross-modal affinity for audio-visual speech sep- aration

Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, and Kwanghoon Sohn. Looking into your speech: Learning cross-modal affinity for audio-visual speech sep- aration. InCVPR, 2021. 2

work page 2021

[25] [25]

Soo Jeoung Lee, Siva Srinivasan, Tracy Trail, David Lewis, and Suzanne Lopez. Examining the relationship among stu- dent perception of support, course satisfaction, and learning outcomes in online learning.The Internet and Higher Edu- cation, 14(3):158–163, 2011. 1 and 2

work page 2011

[26] [26]

Jie Li, Yu Lin, Ming Sun, and Rustam Shadiev. Socially shared regulation of learning in game-based collaborative learning environments promotes algorithmic thinking, learn- ing participation and positive learning attitudes.Interactive Learning Environments, 31(3):1715–1726, 2023. 2

work page 2023

[27] [27]

Morariu, Handong Zhao, Rahul Jain, Varun Manjunatha, and Hui- juan Liu

Peizhong Li, Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Rahul Jain, Varun Manjunatha, and Hui- juan Liu. Selfdoc: Self-supervised document representation learning. InCVPR, pages 5652–5660, 2021. 1, 2, 3, 4, and 5

work page 2021

[28] [28]

Hsin-Yu Liang, Tsung-Yen Hsu, Gwo-Jen Hwang, Shih- Chun Chang, and Hsiao-Chen Chu. A mandatory contribution-based collaborative gaming approach to en- hancing students’ collaborative learning outcomes in science museums.Interactive Learning Environments, 31(5):2692– 2706, 2023. 2

work page 2023

[29] [29]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Ye- ung, and James Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. InNeurIPS, 2022. 2

work page 2022

[30] [30]

Smil: Multimodal learning with severely missing modality

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. Smil: Multimodal learning with severely missing modality. InAAAI, pages 2302–2310, 2021. 1, 2, and 6

work page 2021

[31] [31]

Detecting impasse during collaborative problem solving with multimodal learning analytics

Yuxin Ma, Mehmet Celepkolu, and Kristy Elizabeth Boyer. Detecting impasse during collaborative problem solving with multimodal learning analytics. InLAK22: 12th International Learning Analytics and Knowledge Conference, pages 45– 55, 2022. 2

work page 2022

[32] [32]

Katuka, Mehmet Celepkolu, and Kristy Elizabeth Boyer

Yuxin Ma, Grace A. Katuka, Mehmet Celepkolu, and Kristy Elizabeth Boyer. Investigating multimodal predic- tors of peer satisfaction for collaborative coding in middle school. InProceedings of the 15th International Conference on Educational Data Mining. International Educational Data Mining Society, 2022. 2, 3, and 4

work page 2022

[33] [33]

Facial fea- tures for affective state detection in learning environments

Bradley McDaniel, Sidney D’Mello, Brent King, Patrick Chipman, Kristopher Tapp, and Arthur Graesser. Facial fea- tures for affective state detection in learning environments. InProceedings of the Annual Meeting of the Cognitive Sci- ence Society, 2007. 3

work page 2007

[34] [34]

Olsen, Kshitij Sharma, Nikol Rummel, and Vin- cent Aleven

Jennifer K. Olsen, Kshitij Sharma, Nikol Rummel, and Vin- cent Aleven. Temporal analysis of multimodal data to predict collaborative learning outcomes.British Journal of Educa- tional Technology, 51(5):1527–1547, 2020. 2 and 3

work page 2020

[35] [35]

Towards collaborative convergence: quantifying collaboration quality with auto- mated co-located collaboration analytics

Satyapriya Praharaj, Maren Scheffel, Martin Schmitz, Mar- cus Specht, and Hendrik Drachsler. Towards collaborative convergence: quantifying collaboration quality with auto- mated co-located collaboration analytics. InLAK22: 12th In- ternational Learning Analytics and Knowledge Conference, pages 358–369, 2022. 2

work page 2022

[36] [36]

Roberto U. Puga. Game-based learning: a tool that enhances the collaborative work. InEuropean Conference on Games Based Learning, pages 570–577, 2022. 2

work page 2022

[37] [37]

Toward collaboration sens- ing.International Journal of Computer-Supported Collabo- rative Learning, 9:371–395, 2014

Barbara Schneider and Roy Pea. Toward collaboration sens- ing.International Journal of Computer-Supported Collabo- rative Learning, 9:371–395, 2014. 1

work page 2014

[38] [38]

Utilizing interactive surfaces to enhance learning, collabo- ration and engagement: Insights from learners’ gaze and speech.Sensors, 20(7):1964, 2020

Kshitij Sharma, Ioannis Leftheriotis, and Michail Giannakos. Utilizing interactive surfaces to enhance learning, collabo- ration and engagement: Insights from learners’ gaze and speech.Sensors, 20(7):1964, 2020. 3

work page 1964

[39] [39]

Hyo-Jeong So and Thomas A. Brush. Student perceptions of collaborative learning, social presence and satisfaction in a blended learning environment: Relationships and critical factors.Computers & Education, 51(1):318–336, 2008. 1 and 2

work page 2008

[40] [40]

Starr, Jennifer M

Emily L. Starr, Jennifer M. Reilly, and Barbara Schnei- der. Toward using multi-modal learning analytics to sup- port and measure collaboration in co-located dyads. InICLS 2018: 13th International Conference of the Learning Sci- ences, pages 448–455. International Society of the Learning Sciences, 2018. 2

work page 2018

[41] [41]

Stewart, Zachary Keirn, and Sidney K

Andrew E. Stewart, Zachary Keirn, and Sidney K. D’Mello. Multimodal modeling of collaborative problem-solving facets in triads.User Modeling and User-Adapted Interac- tion, 31(4):713–751, 2021. 2

work page 2021

[42] [42]

Multimodal engagement analysis from facial videos in the classroom

¨Ozg¨ur S ¨umer, Paul Goldberg, Sidney D’Mello, Peter Ger- jets, Ulrich Trautwein, and Enkelejda Kasneci. Multimodal engagement analysis from facial videos in the classroom. IEEE Transactions on Affective Computing, 14(2):1012– 1027, 2021. 3

work page 2021

[43] [43]

Visualizing data using t-SNE.Journal of Machine Learning Research, 9 (11), 2008

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.Journal of Machine Learning Research, 9 (11), 2008. 2

work page 2008

[44] [44]

Chao Wang and Lijuan Huang. A systematic review of se- rious games for collaborative learning: Theoretical frame- work, game mechanic and efficiency assessment.Interna- tional Journal of Emerging Technologies in Learning, 16(6): 88–105, 2021. 2

work page 2021

[45] [45]

Multi-modal learning with missing modality via shared-specific feature modelling

Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi-modal learning with missing modality via shared-specific feature modelling. In CVPR, pages 15878–15887, 2023. 1, 2, and 6

work page 2023

[46] [46]

Connecting multi-modal con- trastive representations

Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jia- geng Liu, Aoxiong Yin, Li Tang, Linjun Li, Yongqi Wang, Ziang Zhang, and Zhou Zhao. Connecting multi-modal con- trastive representations. InNeurIPS, 2023. 1 and 2

work page 2023

[47] [47]

Mmap: multi-modal alignment prompt for cross- domain multi-task learning

Yi Xin, Junlong Du, Qiang Wang, Ke Yan, and Shouhong Ding. Mmap: multi-modal alignment prompt for cross- domain multi-task learning. InAAAI, 2024. 2

work page 2024

[48] [48]

Test-time adaptation against multi-modal reliability bias

Mouxing Yang, Yunfan Li, Changqing Zhang, Peng Hu, and Xi Peng. Test-time adaptation against multi-modal reliability bias. InICLR, 2024. 2

work page 2024

[49] [49]

The effect of ed- ucational games on learning outcomes, student motivation, engagement and satisfaction.Journal of Educational Com- puting Research, 59(3):522–546, 2021

Zhonggen Yu, Ming Gao, and Lili Wang. The effect of ed- ucational games on learning outcomes, student motivation, engagement and satisfaction.Journal of Educational Com- puting Research, 59(3):522–546, 2021. 2

work page 2021

[50] [50]

Abdulazeez Abubakar Yunusa and Ibraheem Nasirudeen Umar. A scoping review of critical predictive factors (CPFs) of satisfaction and perceived learning outcomes in e-learning environments.Education and Information Technologies, 26: 1223–1270, 2021. 2

work page 2021

[51] [51]

Kirschner, and Femke Kirschner

Johanna Zambrano, Paul A. Kirschner, and Femke Kirschner. How cognitive load theory can be applied to col- laborative learning. InAdvances in Cognitive Load Theory: Rethinking Teaching, pages 30–40. 2019. 1

work page 2019

[52] [52]

Student satisfaction, performance, and knowl- edge construction in online collaborative learning.Journal of Educational Technology & Society, 15(1):127–136, 2012

Chang Zhu. Student satisfaction, performance, and knowl- edge construction in online collaborative learning.Journal of Educational Technology & Society, 15(1):127–136, 2012. 1

work page 2012