pith. sign in

arxiv: 2605.19821 · v1 · pith:TE5EJY2Inew · submitted 2026-05-19 · 💻 cs.CV

LaCoVL-FER: Landmark-Guided Contrastive Learning Network with Vision-Language Enhancement for Facial Expression Recognition

Pith reviewed 2026-05-20 06:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords facial expression recognitionlandmark guidancevision-language modelscontrastive learningattention mechanismsreal-world FERCLIP adaptation
0
0 comments X

The pith

LaCoVL-FER fuses facial landmark geometry with vision-language priors to improve expression recognition under real-world variations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LaCoVL-FER as a way to handle pose, occlusion, and illumination challenges in facial expression recognition by combining geometric information from landmarks with semantic information from a pretrained vision-language model. A Landmark-Guided Adaptive Encoder uses Bi-branch Gated Cross Attention to merge landmark-based geometry with visual appearance, while a Vision-Language Enhancement Strategy refines CLIP visual features and applies Expression-Conditioned Prompting to adapt textual features for instance-specific alignment. This produces expression-relevant representations that the authors show outperform prior methods on three benchmark datasets. A reader would care because purely visual attention approaches often produce redundant or unstable focus, and the dual prior strategy offers a concrete route to more stable performance without requiring fully new training data or architectures.

Core claim

The central claim is that a Landmark-Guided Adaptive Encoder with Bi-branch Gated Cross Attention can adaptively fuse geometric priors from landmarks and visual features, and that pairing this with a Vision-Language Enhancement Strategy and Expression-Conditioned Prompting aligns instance-aware visual and textual representations from frozen CLIP encoders, yielding more robust and generalizable features for facial expression recognition in uncontrolled settings.

What carries the argument

The Bi-branch Gated Cross Attention mechanism inside the Landmark-Guided Adaptive Encoder, which performs adaptive fusion of landmark-derived geometric features and visual appearance features to emphasize expression-relevant regions.

If this is right

  • The network reports higher accuracy than prior state-of-the-art methods on the RAF-DB, FERPlus, and AffectNet datasets.
  • Attention focuses more reliably on key facial regions while reducing noise from irrelevant areas.
  • Visual and textual representations become better aligned, supporting improved robustness in uncontrolled environments.
  • The use of frozen CLIP encoders allows semantic priors to be added without retraining the entire visual backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same landmark-plus-language prior pattern could be tested on video-based expression or action recognition where temporal consistency is needed.
  • Because the CLIP components remain frozen, the method could be re-evaluated quickly whenever a newer vision-language foundation model becomes available.
  • If the gated fusion proves stable, the approach might reduce the volume of labeled expression data required for training by leveraging geometric and semantic structure.

Load-bearing premise

The Bi-branch Gated Cross Attention and Vision-Language Enhancement Strategy will extract expression-relevant features that hold up under new pose, occlusion, and lighting conditions rather than capturing dataset-specific artifacts.

What would settle it

Running the model on a fourth real-world FER dataset collected with substantially different pose and occlusion distributions and finding no accuracy gain over strong visual-only baselines would indicate the priors do not deliver the claimed generalization.

Figures

Figures reproduced from arXiv: 2605.19821 by Hui Yu, Jiaxin Wang, Junyu Dong, Muwei Jian, Yifan Xia.

Figure 1
Figure 1. Figure 1: Comparison between conventional contrastive learning framework [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed LaCoVL-FER. To address attention redundancy and instability, it integrates three key components: (1) the Landmark [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The architecture of the BGCA mechanism at the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of facial images from different datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Confusion matrices of our LaCoVL-FER model, where SU: surprise, FE: fear, DI: disgust, HA: happy, SA: sad, AN: anger, NE: neutral, CO: contempt. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention visualization for images from RAF-DB dataset. Column (a)– [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Feature visualization using t-SNE [59] on RAF-DB dataset. Compared to the fundamental text prototype (e.g., “[class]”), more detailed descriptive prompts (e.g., “a [class] facial ex￾pression.”) do not yield consistent performance gains. This phenomenon indicates that in fine-grained FER tasks, overly embellished textual expressions may inadvertently introduce semantic redundancy, thereby attenuating the di… view at source ↗
read the original abstract

Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at https://github.com/ylin06804/LaCoVL-FER.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LaCoVL-FER, a landmark-guided contrastive learning network with vision-language enhancement for facial expression recognition (FER) in the wild. It introduces a Landmark-Guided Adaptive Encoder (LGAE) incorporating Bi-branch Gated Cross Attention (BGCA) to adaptively fuse landmark-based geometric features with visual appearance features. Additionally, it presents a Vision-Language Enhancement Strategy (VLES) that refines CLIP-extracted visual features using expression-relevant information, and an Expression-Conditioned Prompting (ECP) mechanism to adapt textual features from CLIP. The approach is claimed to outperform state-of-the-art methods on RAF-DB, FERPlus, and AffectNet datasets through quantitative and qualitative experiments.

Significance. If the results hold, this work offers a valuable contribution by combining geometric priors from facial landmarks with semantic priors from vision-language models to mitigate attention redundancy and improve generalization in challenging FER scenarios. The use of frozen pretrained CLIP models with targeted adaptation strategies, along with the public availability of the code at the provided GitHub link, enhances the potential for reproducibility and further research in multimodal FER.

major comments (2)
  1. [§3.2] §3.2 (Bi-branch Gated Cross Attention): The mechanism is presented as achieving adaptive fusion of landmark-based geometric and visual features to focus on key regions and suppress noise. However, the description does not include an explicit component or loss term for down-weighting low-confidence landmarks (common under the pose/occlusion/illumination variations highlighted in the introduction), leaving open the possibility that reported gains on the benchmarks reflect dataset-specific landmark detector performance rather than robust generalization.
  2. [§4] §4 (Experiments): The outperformance claims on RAF-DB, FERPlus, and AffectNet rest on the assumption that LGAE+BGCA produces expression-relevant features even with imprecise landmarks. No ablation is described that perturbs landmark inputs (e.g., adding Gaussian noise to coordinates or swapping detectors) or reports per-sample landmark confidence statistics correlated with accuracy gains; such analysis is load-bearing for the robustness narrative.
minor comments (2)
  1. [Abstract / §3] The title references 'contrastive learning' while the abstract and method description emphasize visual-textual alignment via VLES and ECP; a short clarification of any additional contrastive loss (e.g., in §3.3) would remove ambiguity.
  2. [Figures] Figure captions and architecture diagrams should explicitly label the flow from LGAE through BGCA to the CLIP refinement stages to aid readers in tracing the geometric-to-semantic prior integration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help us strengthen the robustness aspects of our work. We address each major comment below and will incorporate revisions to provide additional analysis on landmark handling.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Bi-branch Gated Cross Attention): The mechanism is presented as achieving adaptive fusion of landmark-based geometric and visual features to focus on key regions and suppress noise. However, the description does not include an explicit component or loss term for down-weighting low-confidence landmarks (common under the pose/occlusion/illumination variations highlighted in the introduction), leaving open the possibility that reported gains on the benchmarks reflect dataset-specific landmark detector performance rather than robust generalization.

    Authors: We thank the referee for this observation. The Bi-branch Gated Cross Attention (BGCA) uses learnable gating to adaptively modulate the fusion of geometric and visual features based on their cross-modal compatibility, which implicitly down-weights contributions from less reliable landmarks by reducing their influence in the attention computation. This design aims to suppress noise without an explicit confidence term. To directly address the concern, we will revise §3.2 to elaborate on this implicit mechanism and add a supplementary analysis correlating landmark detector confidence scores with per-sample performance gains. revision: yes

  2. Referee: [§4] §4 (Experiments): The outperformance claims on RAF-DB, FERPlus, and AffectNet rest on the assumption that LGAE+BGCA produces expression-relevant features even with imprecise landmarks. No ablation is described that perturbs landmark inputs (e.g., adding Gaussian noise to coordinates or swapping detectors) or reports per-sample landmark confidence statistics correlated with accuracy gains; such analysis is load-bearing for the robustness narrative.

    Authors: We agree that targeted robustness ablations are important to substantiate the claims under landmark imprecision. The current experiments demonstrate overall improvements but do not include explicit perturbations or confidence correlations. In the revised manuscript, we will add ablations that introduce controlled Gaussian noise to landmark coordinates and evaluate accuracy changes, along with reporting average landmark confidence statistics and their correlation to accuracy on the test sets of RAF-DB, FERPlus, and AffectNet. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical benchmarks

full rationale

The paper defines LaCoVL-FER through explicit architectural components (LGAE with BGCA, VLES, and ECP) that are introduced as novel design choices independent of the reported accuracy numbers. Performance is measured on external datasets (RAF-DB, FERPlus, AffectNet) via standard training and evaluation protocols rather than any derivation that reduces to fitted parameters, self-definitions, or self-citation chains. No equations or mechanisms are shown to be equivalent to their inputs by construction, and the central outperformance claim is falsifiable against those benchmarks without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the effectiveness of three newly introduced architectural modules whose value is demonstrated only through end-to-end empirical results on the cited datasets.

axioms (1)
  • domain assumption Frozen pretrained CLIP encoders supply generalizable visual and textual features that can be usefully refined for the FER task
    The paper freezes both CLIP image and text encoders and builds refinement modules on top of their outputs.
invented entities (3)
  • Landmark-Guided Adaptive Encoder (LGAE) with Bi-branch Gated Cross Attention (BGCA) no independent evidence
    purpose: To adaptively fuse landmark-based geometric priors with visual appearance features
    New module introduced to reduce attention redundancy and focus on expression-relevant regions.
  • Vision-Language Enhancement Strategy (VLES) no independent evidence
    purpose: To refine CLIP visual features using expression-relevant cues
    Proposed to produce more expression-specific visual representations.
  • Expression-Conditioned Prompting (ECP) no independent evidence
    purpose: To adapt fixed class-level text prompts into instance-aware representations
    New mechanism for tighter visual-textual alignment.

pith-pipeline@v0.9.0 · 5844 in / 1501 out tokens · 59674 ms · 2026-05-20T06:20:01.315882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 1 internal anchor

  1. [1]

    Mehrabian,Silent Messages

    A. Mehrabian,Silent Messages. Belmont, CA, USA: Wadsworth, 1971

  2. [2]

    Sixteen facial expressions occur in similar contexts worldwide,

    A. S. Cowen, D. Keltner, F. Schroff, B. Jou, H. Adam, and G. Prasad, “Sixteen facial expressions occur in similar contexts worldwide,”Nature, vol. 589, no. 7841, pp. 251–257, 2021

  3. [3]

    Personalized machine learning for robot perception of affect and engagement in autism therapy,

    O. Rudovic, J. Lee, M. Dai, B. Schuller, and R. W. Picard, “Personalized machine learning for robot perception of affect and engagement in autism therapy,”Sci. Robot., vol. 3, no. 19, p. eaao6760, 2018

  4. [4]

    Towards facial expression analysis in a driver assistance system,

    T. Wilhelm, “Towards facial expression analysis in a driver assistance system,” inProc. 14th IEEE Int. Conf. Autom. Face & Gesture Recognit. (FG), 2019, pp. 1–4

  5. [5]

    Emotion-aware connected health- care big data towards 5G,

    M. S. Hossain and G. Muhammad, “Emotion-aware connected health- care big data towards 5G,”IEEE Internet Things J., vol. 5, no. 4, pp. 2399–2406, 2017

  6. [6]

    Toward label-efficient emotion and sentiment analysis,

    S. Zhao, X. Hong, J. Yang, Y . Zhao, and G. Ding, “Toward label-efficient emotion and sentiment analysis,”Proc. IEEE, vol. 111, no. 10, pp. 1159– 1197, 2023

  7. [7]

    Predicting per- sonalized image emotion perceptions in social networks,

    S. Zhao, H. Yao, Y . Gao, G. Ding, and T.-S. Chua, “Predicting per- sonalized image emotion perceptions in social networks,”IEEE Trans. Affective Comput., vol. 9, no. 4, pp. 526–540, 2016

  8. [8]

    Region attention networks for pose and occlusion robust facial expression recognition,

    K. Wang, X. Peng, J. Yang, D. Meng, and Y . Qiao, “Region attention networks for pose and occlusion robust facial expression recognition,” IEEE Trans. Image Process., vol. 29, pp. 4057–4069, 2020

  9. [9]

    A dual stream attention network for facial expression recognition in the wild,

    H. Tang, Y . Li, and Z. Jin, “A dual stream attention network for facial expression recognition in the wild,”Int. J. Mach. Learn. Cybern., vol. 15, no. 12, pp. 5863–5880, 2024

  10. [10]

    Facial expression recognition with visual transformers and attentional selective fusion,

    F. Ma, B. Sun, and S. Li, “Facial expression recognition with visual transformers and attentional selective fusion,”IEEE Trans. Affective Comput., vol. 14, no. 2, pp. 1236–1248, Jun. 2021

  11. [11]

    Transfer: Learning relation-aware facial expression representations with transformers,

    F. Xue, Q. Wang, and G. Guo, “Transfer: Learning relation-aware facial expression representations with transformers,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 3601–3610

  12. [12]

    QCS: Feature refining from quadruplet cross similarity for facial expression recognition,

    C. Wang, L. Chen, L. Wang, Z. Li, and X. Lv, “QCS: Feature refining from quadruplet cross similarity for facial expression recognition,” in Proc. AAAI Conf. Artif. Intell., vol. 39, no. 7, 2025, pp. 7563–7572

  13. [13]

    POSTER: A pyramid cross- fusion transformer network for facial expression recognition,

    C. Zheng, M. Mendieta, and C. Chen, “POSTER: A pyramid cross- fusion transformer network for facial expression recognition,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 3146–3155

  14. [14]

    POSTER++: A simpler and stronger facial expression recognition network,

    J. Mao, R. Xu, X. Yin, Y . Chang, B. Nie, A. Huang, and Y . Wang, “POSTER++: A simpler and stronger facial expression recognition network,”Pattern Recognit., vol. 157, p. 110951, 2025

  15. [15]

    LA-Net: Landmark-aware learning for reliable facial expression recognition under label noise,

    Z. Wu and J. Cui, “LA-Net: Landmark-aware learning for reliable facial expression recognition under label noise,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 20698–20707

  16. [16]

    Adaptive multilayer perceptual attention network for facial expression recognition,

    H. Liu, H. Cai, Q. Lin, X. Li, and H. Xiao, “Adaptive multilayer perceptual attention network for facial expression recognition,”IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 9, pp. 6253–6266, 2022

  17. [17]

    Estimation of continuous valence and arousal levels from faces in naturalistic conditions,

    A. Toisoul, J. Kossaifi, A. Bulat, G. Tzimiropoulos, and M. Pantic, “Estimation of continuous valence and arousal levels from faces in naturalistic conditions,”Nat. Mach. Intell., vol. 3, no. 1, pp. 42–50, 2021

  18. [18]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763

  19. [19]

    CLIPER: A unified vision-language framework for in-the-wild facial expression recognition,

    H. Li, H. Niu, Z. Zhu, and F. Zhao, “CLIPER: A unified vision-language framework for in-the-wild facial expression recognition,” inProc. IEEE Int. Conf. Multimedia Expo (ICME), 2024, pp. 1–6

  20. [20]

    CEPrompt: Cross-modal emotion-aware prompting for facial expression recognition,

    H. Zhou, S. Huang, F. Zhang, and C. Xu, “CEPrompt: Cross-modal emotion-aware prompting for facial expression recognition,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 11, pp. 11886–11899, 2024

  21. [21]

    VLCA: Vision-language feature enhancement with cross-attention learning for facial expression recognition,

    Z. Zheng, H. Wu, J. Wang, L. Lv, D. Bardou, and G. Yu, “VLCA: Vision-language feature enhancement with cross-attention learning for facial expression recognition,”Expert Syst. Appl., p. 130292, 2025

  22. [22]

    Text prompt region decomposition for effective facial expression recognition,

    W. Nie, H. Zhang, X. Zhang, Z. Wang, and H. Liu, “Text prompt region decomposition for effective facial expression recognition,”IEEE Trans. Affective Comput., 2025

  23. [23]

    Multi-Modal Prompt Learning for Facial Expression Recognition: Leveraging Emojis and Large Language Models,

    E. Pei, H. Zhao, T. Zhang, D. Jiang, L. He, and H. Chen, “Multi-Modal Prompt Learning for Facial Expression Recognition: Leveraging Emojis and Large Language Models,”Inf. Fusion, p. 104063, 2025

  24. [24]

    A compact embedding for facial expression similarity,

    R. Vemulapalli and A. Agarwala, “A compact embedding for facial expression similarity,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 5683–5692

  25. [25]

    FaceNet: A unified embedding for face recognition and clustering,

    F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 815–823

  26. [26]

    Facial expression recognition based on local binary patterns: A comprehensive study,

    C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognition based on local binary patterns: A comprehensive study,”Image Vis. Comput., vol. 27, no. 6, pp. 803–816, 2009

  27. [27]

    Recognizing facial actions using gabor wavelets with neutral face average difference,

    J. J. Bazzo and M. V . Lamar, “Recognizing facial actions using gabor wavelets with neutral face average difference,” inProc. 6th IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), 2004, pp. 505–510

  28. [28]

    Island loss for learning discriminative features in facial expression recognition,

    J. Cai, Z. Meng, A. S. Khan, Z. Li, J. O’Reilly, and Y . Tong, “Island loss for learning discriminative features in facial expression recognition,” inProc. 13th IEEE Int. Conf. Autom. Face & Gesture Recognit. (FG), 2018, pp. 302–309

  29. [29]

    Identity–expression dual branch network for facial expression recognition,

    H. Zhang, W. Su, J. Yu, and Z. Wang, “Identity–expression dual branch network for facial expression recognition,”IEEE Trans. Cogn. Dev. Syst., vol. 13, no. 4, pp. 898–911, 2020

  30. [30]

    Low-resolution facial expression recognition: A filter learning perspective,

    Y . Yan, Z. Zhang, S. Chen, and H. Wang, “Low-resolution facial expression recognition: A filter learning perspective,”Signal Process., vol. 169, p. 107370, 2020

  31. [31]

    Suppressing uncer- tainties for large-scale facial expression recognition,

    K. Wang, X. Peng, J. Yang, S. Lu, and Y . Qiao, “Suppressing uncer- tainties for large-scale facial expression recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 6897–6906

  32. [32]

    Robust lightweight facial expression recognition network with label distribution training,

    Z. Zhao, Q. Liu, and F. Zhou, “Robust lightweight facial expression recognition network with label distribution training,” inProc. AAAI Conf. Artif. Intell., vol. 35, no. 4, 2021, pp. 3510–3519

  33. [33]

    Multi-relations aware network for in-the-wild facial expression recognition,

    D. Chen, G. Wen, H. Li, R. Chen, and C. Li, “Multi-relations aware network for in-the-wild facial expression recognition,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 3848–3859, 2023

  34. [34]

    An image is worth 16×16 words: Transformers for image recognition at scale,

    A. Dosovitskiyet al., “An image is worth 16×16 words: Transformers for image recognition at scale,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021, pp. 1–11

  35. [35]

    Vision transformer with attentive pooling for robust facial expression recognition,

    F. Xue, Q. Wang, Z. Tan, Z. Ma, and G. Guo, “Vision transformer with attentive pooling for robust facial expression recognition,”IEEE Trans. Affective Comput., vol. 14, no. 4, pp. 3244–3256, 2022

  36. [36]

    Chen,PyTorch Face Landmark: A Fast and Accurate Facial Landmark Detector, 2021

    C. Chen,PyTorch Face Landmark: A Fast and Accurate Facial Landmark Detector, 2021. [Online]. Available: https://github.com/cunjian/pytorchfacelandmark

  37. [37]

    ICoCO: Interpretable concept-guided context optimization for trustworthy facial expression recognition in mental health monitoring,

    L. Zhao, B. Pu, X. Qi, C. Zhu, Q. Lin, C. Wang, and K. Li, “ICoCO: Interpretable concept-guided context optimization for trustworthy facial expression recognition in mental health monitoring,”IEEE Trans. Affec- tive Comput., 2026

  38. [38]

    ArcFace: Additive angular margin loss for deep face recognition,

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive angular margin loss for deep face recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 4690–4699

  39. [39]

    Reliable crowdsourcing and deep locality- preserving learning for expression recognition in the wild,

    S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality- preserving learning for expression recognition in the wild,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 2852–2861

  40. [40]

    Training deep networks for facial expression recognition with crowd-sourced label distribution,

    E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” inProc. ACM Int. Conf. Multimodal Interact. (ICMI), 2016, pp. 279–283

  41. [41]

    AffectNet: A database for facial expression, valence, and arousal computing in the wild,

    A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A database for facial expression, valence, and arousal computing in the wild,”IEEE Trans. Affective Comput., vol. 10, no. 1, pp. 18–31, Mar. 2017

  42. [42]

    Challenges in representation learning: A report on three machine learning contests,

    I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y . Tang, D. Thaler, D.-H. Lee,et al., “Challenges in representation learning: A report on three machine learning contests,” inProc. Int. Conf. Neural Inf. Process. (ICONIP), 2013, pp. 117–124

  43. [43]

    [Online]

    Imbalanced Dataset Sampler, 2019. [Online]. Available: https://github.com/ufoym/imbalanced-dataset-sampler

  44. [44]

    Dive into ambi- guity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition,

    J. She, Y . Hu, H. Shi, J. Wang, Q. Shen, and T. Mei, “Dive into ambi- guity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 6248–6257

  45. [45]

    Learn from all: Erasing attention consistency for noisy label facial expression recognition,

    Y . Zhang, C. Wang, X. Ling, and W. Deng, “Learn from all: Erasing attention consistency for noisy label facial expression recognition,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 418–434

  46. [46]

    Teaching with soft label smoothing for mitigating noisy labels in facial expressions,

    T. Lukov, N. Zhao, G. H. Lee, and S.-N. Lim, “Teaching with soft label smoothing for mitigating noisy labels in facial expressions,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 648–665

  47. [47]

    FG-AGR: Fine- grained associative graph representation for facial expression recognition in the wild,

    C. Li, X. Li, X. Wang, D. Huang, Z. Liu, and L. Liao, “FG-AGR: Fine- grained associative graph representation for facial expression recognition in the wild,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 2, pp. 882–896, 2023. 13

  48. [48]

    Robust facial expression recognition by simultaneously addressing hard and mislabeled samples,

    Y . Min, R. Xu, J. Chen, Y . Ji, and X. Liu, “Robust facial expression recognition by simultaneously addressing hard and mislabeled samples,” Pattern Recognit., vol. 170, p. 112026, 2026

  49. [49]

    AUNet: An action unit–driven local–global interac- tive attention network with emotion-aware contrastive learning for facial expression recognition,

    D. Guo and F. Xu, “AUNet: An action unit–driven local–global interac- tive attention network with emotion-aware contrastive learning for facial expression recognition,”Knowl.-Based Syst., p. 115569, 2026

  50. [50]

    PIDViT: Pose-invariant distilled vision transformer for facial expression recognition in the wild,

    Y .-F. Huang and C.-H. Tsai, “PIDViT: Pose-invariant distilled vision transformer for facial expression recognition in the wild,”IEEE Trans. Affective Comput., vol. 14, no. 4, pp. 3281–3293, 2022

  51. [51]

    ExpLLM: Towards Chain of Thought for Facial Expression Recognition,

    X. Lan, J. Xue, J. Qi, D. Jiang, K. Lu, and T.-S. Chua, “ExpLLM: Towards Chain of Thought for Facial Expression Recognition,”IEEE Trans. Multimedia, 2025

  52. [52]

    AMGSN: Adaptive mask-guide supervised network for debiased facial expression recognition,

    T. Gu, H. Li, X. Feng, and Y . Luo, “AMGSN: Adaptive mask-guide supervised network for debiased facial expression recognition,”Pattern Recognit., vol. 170, p. 112023, 2026

  53. [53]

    Learning deep global multi-scale and local attention features for facial expression recognition in the wild,

    Z. Zhao, Q. Liu, and S. Wang, “Learning deep global multi-scale and local attention features for facial expression recognition in the wild,” IEEE Trans. Image Process., vol. 30, pp. 6544–6556, 2021

  54. [54]

    CRS-CONT: A Well-Trained General Encoder for Facial Expression Analysis,

    H. Li, N. Wang, X. Yang, and X. Gao, “CRS-CONT: A Well-Trained General Encoder for Facial Expression Analysis,”IEEE Trans. Image Process., vol. 31, pp. 4637–4650, 2022

  55. [55]

    FERMixNet: An occlusion robust facial expression recognition model with facial mixing augmentation and mid-level representation learning,

    Y . Huang, J. Peng, W. Zhang, T. Zhao, G. Chen, S. Tan, F. Yi, and L. Wang, “FERMixNet: An occlusion robust facial expression recognition model with facial mixing augmentation and mid-level representation learning,”IEEE Trans. Affective Comput., vol. 16, no. 2, pp. 639–654, 2024

  56. [56]

    Adaptively learning facial expression representation via cf labels and distillation,

    H. Li, N. Wang, X. Ding, X. Yang, and X. Gao, “Adaptively learning facial expression representation via cf labels and distillation,”IEEE Trans. Image Process., vol. 30, pp. 2016–2028, 2021

  57. [57]

    MS-Celeb-1M: A dataset and benchmark for large-scale face recognition,

    Y . Guo, L. Zhang, Y . Hu, X. He, and J. Gao, “MS-Celeb-1M: A dataset and benchmark for large-scale face recognition,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 87–102

  58. [58]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014

  59. [59]

    Visualizing data using t-SNE,

    L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”J. Mach. Learn. Res., vol. 9, no. 11, pp. 2579–2605, 2008. Jiaxin Wangreceived the B.E. degree in Computer Science and Technology from Yangtze University in 2025. She is currently pursuing the M.E. degree in Control Science and Engineering at Shandong University, Weihai, China. Her current re...