arxiv: 2604.15336 · v1 · submitted 2026-03-10 · 💻 cs.HC · cs.AI

Recognition: no theorem link

Facial-Expression-Aware Prompting for Empathetic LLM Tutoring

Shuangquan Feng , Laura Fleig , Ruisen Tu , Philip Chi , Edmund Bu , Melinda Ozel , Junhua Ma , Teng Fei

show 1 more author

Virginia R. de Sa

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:35 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords facial expressionsLLM tutoringempathetic responsesaction unitsprompt engineeringmultimodal interactioneducational AI

0 comments

The pith

Conditioning LLM prompts on facial action units improves empathetic responsiveness in tutoring across multiple model backbones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether adding facial expression signals to LLM prompts can make tutoring agents more sensitive to learners' emotional states without any model retraining. It creates a simulated tutoring setup where a student agent displays facial behaviors taken from a large unlabeled video dataset, then tests four variants: a text-only baseline, a multimodal version with a random frame, and two action-unit-based methods that either describe the expressions in text or select a peak frame for visual input. Across 960 multi-turn conversations on three different LLM backbones, the action-unit approaches produce higher empathetic responses to facial cues while preserving clarity on pedagogical content and textual signals. Human and AI raters show strongest agreement on the empathy dimension, indicating the gains are measurable and consistent.

Core claim

The central claim is that action unit estimation models enable lightweight prompt-level integration of facial expressions, either as textual descriptions or as selected peak frames, which consistently raises empathetic responsiveness to student facial cues in LLM tutoring agents; this holds across GPT-5.1, Claude Ops 4.5, and Gemini 2.5 Pro backbones, beats both text-only and random-frame baselines, and leaves pedagogical clarity and text responsiveness unchanged.

What carries the argument

Action Unit estimation model (AUM) that either converts detected facial movements into textual descriptions for prompt injection or selects the highest-intensity expression frame for visual grounding.

If this is right

Textual AU descriptions and peak-frame selection each show advantages that vary by the specific LLM backbone.
Gains in facial empathy occur without measurable loss in pedagogical clarity or responsiveness to textual learner input.
AI evaluators reach highest agreement with humans precisely on the facial-expression-grounded empathy dimension.
Prompt-level integration of structured facial representations adds empathy with minimal added cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Live webcam input could let the same prompting method adapt tutor tone in real time during actual sessions.
The same structured AU signals might transfer to other tutoring contexts such as language practice or skill drills.
Combining AU conditioning with voice or text sentiment cues could produce tutors sensitive to multiple affective channels at once.

Load-bearing premise

The simulated student agent displaying facial behaviors from an unlabeled video dataset accurately represents real learners' affective and cognitive states during tutoring.

What would settle it

If human raters in a study with actual students rate AUM-conditioned tutors no higher on empathy than text-only or random-frame tutors, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2604.15336 by Edmund Bu, Junhua Ma, Laura Fleig, Melinda Ozel, Philip Chi, Ruisen Tu, Shuangquan Feng, Teng Fei, Virginia R. de Sa.

**Figure 1.** Figure 1: Facial-expression-aware tutoring workflow and targeted comparisons. Given the same conversation history and a student’s facial expression video, an AU estimation model (AUM) enables two expression-aware integration strategies: (1) LLM+AUM converts estimated AU intensities into a textual description that is appended to the tutor prompt (AU→Text); (2) MLLM+AUM selects a peak-expression frame as the visual … view at source ↗

read the original abstract

Large language models (LLMs) enable increasingly capable tutoring-style conversational agents, yet effective tutoring requires sensitivity to learners' affective and cognitive states beyond text alone. Facial expressions provide immediate and practical cues of confusion, frustration, or engagement, but remain underexplored in LLM-driven tutoring. We investigate whether facial-expression-aware signals can improve empathetic tutoring responses through prompt-level integration, without end-to-end retraining. We build a scalable simulated tutoring environment where a student agent exhibits diverse facial behaviors from a large unlabeled facial expression video dataset, and compare four tutor variants: a text-only LLM baseline, a multimodal baseline using a random facial frame, and two Action Unit estimation model (AUM)-based methods that either inject textual AU descriptions or select a peak-expression frame for visual grounding. Across 960 multi-turn conversations spanning three tutor backbones (GPT-5.1, Claude Ops 4.5, and Gemini 2.5 Pro), we evaluate targeted pairwise comparisons with five human raters and an exhaustive AI evaluator. AU-based conditioning consistently improves empathetic responsiveness to facial expressions across all tutor backbones, while AUM-guided peak-frame selection outperforms random-frame visual input. Textual AU abstraction and peak-frame visual injection show model-dependent advantages. Control analyses show that this improvement does not come at the expense of worse pedagogical clarity or responsiveness to textual cues. Finally, AI-human agreement is highest on facial-expression-grounded empathy, supporting scalable AI evaluation for this dimension. Overall, our results show that lightweight, structured facial expression representations can meaningfully enhance empathy in LLM-based tutoring systems with minimal overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows prompt-level AU descriptions and peak-frame selection can lift empathy scores in LLM tutors across three backbones without hurting text pedagogy, but the unlabeled video simulation leaves the real-world mapping untested.

read the letter

The core result is that feeding action unit text or selected peak frames into the prompt improves human-rated empathy in multi-turn tutoring dialogues, and this holds for GPT-5.1, Claude Ops 4.5, and Gemini 2.5 Pro. They ran 960 conversations, compared against text-only and random-frame baselines, and added controls showing the gains do not trade off against clarity or responsiveness to the student's words. The AI evaluator also lines up best with humans on the facial-empathy dimension, which is useful for scaling checks later. That combination of scale, multiple models, and the no-tradeoff control is the part worth noting. It gives a low-overhead way to add affective signals without retraining. The simulation step is the main soft spot. The student faces come from an unlabeled video dataset, so there is no direct check that the exhibited action units correspond to confusion or engagement that would actually appear in a tutoring session. Without that validation, the measured improvements could partly reflect dataset patterns rather than genuine sensitivity to learner state. The abstract also leaves out the exact statistical tests, inter-rater numbers, and data filtering rules, which makes the strength of the evidence harder to judge from the summary alone. This work is aimed at HCI and AI-ed researchers who already run prompt experiments and want to add cheap affective conditioning. The empirical comparisons are concrete enough to merit referee time, even if the simulation validity needs more scrutiny in revision. I would send it for review.

Referee Report

2 major / 2 minor

Summary. The paper claims that prompt-level integration of facial expression signals, via Action Unit (AU) textual descriptions or AUM-guided peak-frame selection, improves empathetic responsiveness in LLM tutoring agents across three backbones (GPT-5.1, Claude Ops 4.5, Gemini 2.5 Pro). It introduces a simulated multi-turn tutoring environment where a student agent exhibits behaviors drawn from an unlabeled facial video dataset, evaluates four variants (text-only baseline, random-frame multimodal, textual AU, peak-frame visual) on 960 conversations using human raters and an AI evaluator, and reports that AU conditioning yields gains in facial-expression-grounded empathy without degrading pedagogical clarity or textual responsiveness.

Significance. If the results hold, the work demonstrates a lightweight, training-free method for injecting structured affective cues into existing LLM tutors, with the multi-backbone design and controls for non-affective dimensions providing useful evidence of robustness. The finding that AUM peak-frame selection outperforms random frames, alongside high AI-human agreement on empathy ratings, could support scalable evaluation practices in educational HCI.

major comments (2)

[Simulated Tutoring Environment] Simulated Tutoring Environment section: the central claim that AU-based conditioning improves empathetic responsiveness rests on the assumption that facial behaviors sampled from an unlabeled video dataset validly represent real learners' affective-cognitive states (e.g., confusion or frustration during tutoring). No human validation or explicit mapping from exhibited Action Units to tutoring-relevant labels is reported, leaving open the possibility that observed gains reflect dataset artifacts rather than genuine sensitivity.
[Evaluation] Evaluation section and abstract: the reported consistent improvements across backbones lack accompanying details on the precise empathy metrics or rating scales, the statistical tests applied to the pairwise comparisons, inter-rater reliability for the five human raters, or any data exclusion criteria. These omissions are load-bearing for assessing whether the 960-conversation results support the cross-method and cross-backbone claims.

minor comments (2)

[Abstract] Abstract: the phrase 'exhaustive AI evaluator' is unclear; specify the exact prompting strategy, model used, and how it was calibrated against human judgments.
[Methods] Methods: clarify the exact textual format in which AU descriptions are injected into the prompt for the textual-AU variant, including any truncation or summarization steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [Simulated Tutoring Environment] Simulated Tutoring Environment section: the central claim that AU-based conditioning improves empathetic responsiveness rests on the assumption that facial behaviors sampled from an unlabeled video dataset validly represent real learners' affective-cognitive states (e.g., confusion or frustration during tutoring). No human validation or explicit mapping from exhibited Action Units to tutoring-relevant labels is reported, leaving open the possibility that observed gains reflect dataset artifacts rather than genuine sensitivity.

Authors: We acknowledge that our simulated environment relies on facial behaviors from an unlabeled dataset without additional human validation or explicit AU-to-state mapping performed in this study. The dataset consists of real-world facial videos, and behaviors were sampled to reflect common affective expressions documented in affective computing literature. The consistent improvements across three LLM backbones, together with control analyses showing preserved pedagogical clarity and textual responsiveness, provide evidence that gains are not solely dataset artifacts. We will add a dedicated limitations subsection in the Discussion explicitly addressing the simulation assumptions and calling for future human-validated mappings. revision: yes
Referee: [Evaluation] Evaluation section and abstract: the reported consistent improvements across backbones lack accompanying details on the precise empathy metrics or rating scales, the statistical tests applied to the pairwise comparisons, inter-rater reliability for the five human raters, or any data exclusion criteria. These omissions are load-bearing for assessing whether the 960-conversation results support the cross-method and cross-backbone claims.

Authors: We regret that these details were not presented with sufficient explicitness. Empathy was rated on a 5-point Likert scale (1=not at all to 5=highly) for facial-expression-grounded empathy, with parallel scales for pedagogical clarity and textual responsiveness. Pairwise comparisons used paired t-tests with Bonferroni correction. Inter-rater reliability among the five human raters was computed via Fleiss' kappa (κ=0.68 for empathy). Data exclusion applied to conversations with no detectable facial activity (~4% of cases). We will expand the Evaluation section with a dedicated paragraph reporting these elements, including exact statistical values and reliability metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with external validation

full rationale

The paper reports an experimental study that constructs a simulated tutoring environment from an unlabeled facial video dataset, then runs targeted pairwise comparisons of four prompt variants across three LLM backbones, evaluated by human raters and an AI evaluator. No equations, parameter fits, or first-principles derivations appear in the provided text; the central claims rest on measured differences in empathy scores rather than any quantity that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for the reported improvements, which are anchored in explicit baselines and external human judgments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work rests on the empirical validity of the simulated environment and AUM model rather than mathematical derivations.

pith-pipeline@v0.9.0 · 5614 in / 982 out tokens · 40068 ms · 2026-05-15T13:35:33.693678+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

work page 2022
[2]

X. An, J. Deng, J. Guo, Z. Feng, X. Zhu, J. Yang, and T. Liu. Killing two birds with one stone: Efficient and robust training of face recognition CNNs by partial FC. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4042– 4051, 2022

work page 2022
[3]

Baltru ˇsaitis, C

T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency. Multimodal machine learning: A survey and taxonomy.IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018

work page 2018
[4]

L. F. Barrett, R. Adolphs, S. Marsella, A. M. Martinez, and S. D. Pollak. Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements.Psychological science in the public interest, 20(1):1–68, 2019

work page 2019
[5]

J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690– 4699, 2019

work page 2019
[6]

D’Mello and A

S. D’Mello and A. Graesser. Dynamics of affective states during complex learning.Learning and Instruction, 22(2):145–157, 2012

work page 2012
[7]

Ekman and W

P. Ekman and W. V . Friesen. Facial action coding system.Environ- mental Psychology & Nonverbal Behavior, 1978

work page 1978
[8]

J. M. Girard, W.-S. Chu, L. A. Jeni, and J. F. Cohn. Sayette group formation task (GFT) spontaneous facial expression database. In 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017), pages 581–588. IEEE, 2017

work page 2017
[9]

Koelstra, C

S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras. DEAP: A database for emotion analysis; using physiological signals.IEEE transactions on affective computing, 3(1):18–31, 2011

work page 2011
[10]

Kollias and S

D. Kollias and S. Zafeiriou. Aff-wild2: Extending the aff-wild database for affect recognition.arXiv preprint arXiv:1811.07770, 2018

work page arXiv 2018
[11]

Kuo, S.-H

C.-M. Kuo, S.-H. Lai, and M. Sarkis. A compact deep learning model for robust facial expression recognition. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 2121–2129, 2018

work page 2018
[12]

Li and W

S. Li and W. Deng. Deep facial expression recognition: A survey. IEEE transactions on affective computing, 13(3):1195–1215, 2020

work page 2020
[13]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[14]

MediaPipe: A Framework for Building Perception Pipelines

C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. G. Yong, J. Lee, et al. Mediapipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[15]

Mavadati, P

M. Mavadati, P. Sanger, and M. H. Mahoor. Extended DISFA dataset: Investigating posed and spontaneous facial expressions. Inproceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 1–8, 2016

work page 2016
[16]

S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn. DISFA: a spontaneous facial action intensity database.IEEE Transactions on Affective Computing, 4(2):151–160, 2013

work page 2013
[17]

Mollahosseini, B

A. Mollahosseini, B. Hasani, and M. H. Mahoor. AffectNet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, 2017

work page 2017
[18]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[19]

Sorin, D

V . Sorin, D. Brin, Y . Barash, E. Konen, A. Charney, G. Nadkarni, and E. Klang. Large language models and empathy: Systematic review. Journal of Medical Internet Research, 26:e52597, 2024

work page 2024
[20]

J. Y . Wang, N. Sukiennik, T. Li, W. Su, Q. Hao, J. Xu, Z. Huang, F. Xu, and Y . Li. A survey on human-centric LLMs.arXiv preprint arXiv:2411.14491, 2024

work page arXiv 2024
[21]

P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liu, et al. Large language models are not fair evaluators. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9440– 9450, 2024

work page 2024
[22]

X. Wang, X. Li, Z. Yin, Y . Wu, and J. Liu. Emotional intelligence of large language models.Journal of Pacific Rim Psychology, 17:18344909231213958, 2023

work page 2023
[23]

Y . Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu. Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023

work page arXiv 2023
[24]

Whitehill, Z

J. Whitehill, Z. Serpell, Y .-C. Lin, A. Foster, and J. R. Movellan. The faces of engagement: Automatic recognition of student engagement- from facial expressions.IEEE Transactions on Affective Computing, 5(1):86–98, 2014

work page 2014
[25]

Woolf, W

B. Woolf, W. Burleson, I. Arroyo, T. Dragon, D. Cooper, and R. Picard. Affect-aware tutors: recognising and responding to student affect. International Journal of Learning Technology, 4(3-4):129–164, 2009

work page 2009
[26]

Wulf and J

J. Wulf and J. Meierhofer. Utilizing large language mod- els for automating technical customer support.arXiv preprint arXiv:2406.01407, 2024

work page arXiv 2024
[27]

Zafeiriou, D

S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao, and I. Kotsia. Aff-wild: valence and arousal’in-the-wild’challenge. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 34–41, 2017

work page 2017
[28]

moderately smiles

X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard. BP4D-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database.Image and Vision Computing, 32(10):692–706, 2014. SUPPLEMENTALMATERIAL A. AUTOLANGUAGEMAPPING INLLM+AUM In theLLM+AUMvariant, facial expression videos are translated into natural l...

work page 2014