LaCoVL-FER: Landmark-Guided Contrastive Learning Network with Vision-Language Enhancement for Facial Expression Recognition
Pith reviewed 2026-05-20 06:20 UTC · model grok-4.3
The pith
LaCoVL-FER fuses facial landmark geometry with vision-language priors to improve expression recognition under real-world variations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a Landmark-Guided Adaptive Encoder with Bi-branch Gated Cross Attention can adaptively fuse geometric priors from landmarks and visual features, and that pairing this with a Vision-Language Enhancement Strategy and Expression-Conditioned Prompting aligns instance-aware visual and textual representations from frozen CLIP encoders, yielding more robust and generalizable features for facial expression recognition in uncontrolled settings.
What carries the argument
The Bi-branch Gated Cross Attention mechanism inside the Landmark-Guided Adaptive Encoder, which performs adaptive fusion of landmark-derived geometric features and visual appearance features to emphasize expression-relevant regions.
If this is right
- The network reports higher accuracy than prior state-of-the-art methods on the RAF-DB, FERPlus, and AffectNet datasets.
- Attention focuses more reliably on key facial regions while reducing noise from irrelevant areas.
- Visual and textual representations become better aligned, supporting improved robustness in uncontrolled environments.
- The use of frozen CLIP encoders allows semantic priors to be added without retraining the entire visual backbone.
Where Pith is reading between the lines
- The same landmark-plus-language prior pattern could be tested on video-based expression or action recognition where temporal consistency is needed.
- Because the CLIP components remain frozen, the method could be re-evaluated quickly whenever a newer vision-language foundation model becomes available.
- If the gated fusion proves stable, the approach might reduce the volume of labeled expression data required for training by leveraging geometric and semantic structure.
Load-bearing premise
The Bi-branch Gated Cross Attention and Vision-Language Enhancement Strategy will extract expression-relevant features that hold up under new pose, occlusion, and lighting conditions rather than capturing dataset-specific artifacts.
What would settle it
Running the model on a fourth real-world FER dataset collected with substantially different pose and occlusion distributions and finding no accuracy gain over strong visual-only baselines would indicate the priors do not deliver the claimed generalization.
Figures
read the original abstract
Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at https://github.com/ylin06804/LaCoVL-FER.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes LaCoVL-FER, a landmark-guided contrastive learning network with vision-language enhancement for facial expression recognition (FER) in the wild. It introduces a Landmark-Guided Adaptive Encoder (LGAE) incorporating Bi-branch Gated Cross Attention (BGCA) to adaptively fuse landmark-based geometric features with visual appearance features. Additionally, it presents a Vision-Language Enhancement Strategy (VLES) that refines CLIP-extracted visual features using expression-relevant information, and an Expression-Conditioned Prompting (ECP) mechanism to adapt textual features from CLIP. The approach is claimed to outperform state-of-the-art methods on RAF-DB, FERPlus, and AffectNet datasets through quantitative and qualitative experiments.
Significance. If the results hold, this work offers a valuable contribution by combining geometric priors from facial landmarks with semantic priors from vision-language models to mitigate attention redundancy and improve generalization in challenging FER scenarios. The use of frozen pretrained CLIP models with targeted adaptation strategies, along with the public availability of the code at the provided GitHub link, enhances the potential for reproducibility and further research in multimodal FER.
major comments (2)
- [§3.2] §3.2 (Bi-branch Gated Cross Attention): The mechanism is presented as achieving adaptive fusion of landmark-based geometric and visual features to focus on key regions and suppress noise. However, the description does not include an explicit component or loss term for down-weighting low-confidence landmarks (common under the pose/occlusion/illumination variations highlighted in the introduction), leaving open the possibility that reported gains on the benchmarks reflect dataset-specific landmark detector performance rather than robust generalization.
- [§4] §4 (Experiments): The outperformance claims on RAF-DB, FERPlus, and AffectNet rest on the assumption that LGAE+BGCA produces expression-relevant features even with imprecise landmarks. No ablation is described that perturbs landmark inputs (e.g., adding Gaussian noise to coordinates or swapping detectors) or reports per-sample landmark confidence statistics correlated with accuracy gains; such analysis is load-bearing for the robustness narrative.
minor comments (2)
- [Abstract / §3] The title references 'contrastive learning' while the abstract and method description emphasize visual-textual alignment via VLES and ECP; a short clarification of any additional contrastive loss (e.g., in §3.3) would remove ambiguity.
- [Figures] Figure captions and architecture diagrams should explicitly label the flow from LGAE through BGCA to the CLIP refinement stages to aid readers in tracing the geometric-to-semantic prior integration.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help us strengthen the robustness aspects of our work. We address each major comment below and will incorporate revisions to provide additional analysis on landmark handling.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Bi-branch Gated Cross Attention): The mechanism is presented as achieving adaptive fusion of landmark-based geometric and visual features to focus on key regions and suppress noise. However, the description does not include an explicit component or loss term for down-weighting low-confidence landmarks (common under the pose/occlusion/illumination variations highlighted in the introduction), leaving open the possibility that reported gains on the benchmarks reflect dataset-specific landmark detector performance rather than robust generalization.
Authors: We thank the referee for this observation. The Bi-branch Gated Cross Attention (BGCA) uses learnable gating to adaptively modulate the fusion of geometric and visual features based on their cross-modal compatibility, which implicitly down-weights contributions from less reliable landmarks by reducing their influence in the attention computation. This design aims to suppress noise without an explicit confidence term. To directly address the concern, we will revise §3.2 to elaborate on this implicit mechanism and add a supplementary analysis correlating landmark detector confidence scores with per-sample performance gains. revision: yes
-
Referee: [§4] §4 (Experiments): The outperformance claims on RAF-DB, FERPlus, and AffectNet rest on the assumption that LGAE+BGCA produces expression-relevant features even with imprecise landmarks. No ablation is described that perturbs landmark inputs (e.g., adding Gaussian noise to coordinates or swapping detectors) or reports per-sample landmark confidence statistics correlated with accuracy gains; such analysis is load-bearing for the robustness narrative.
Authors: We agree that targeted robustness ablations are important to substantiate the claims under landmark imprecision. The current experiments demonstrate overall improvements but do not include explicit perturbations or confidence correlations. In the revised manuscript, we will add ablations that introduce controlled Gaussian noise to landmark coordinates and evaluate accuracy changes, along with reporting average landmark confidence statistics and their correlation to accuracy on the test sets of RAF-DB, FERPlus, and AffectNet. revision: yes
Circularity Check
No significant circularity; claims rest on independent empirical benchmarks
full rationale
The paper defines LaCoVL-FER through explicit architectural components (LGAE with BGCA, VLES, and ECP) that are introduced as novel design choices independent of the reported accuracy numbers. Performance is measured on external datasets (RAF-DB, FERPlus, AffectNet) via standard training and evaluation protocols rather than any derivation that reduces to fitted parameters, self-definitions, or self-citation chains. No equations or mechanisms are shown to be equivalent to their inputs by construction, and the central outperformance claim is falsifiable against those benchmarks without internal circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Frozen pretrained CLIP encoders supply generalizable visual and textual features that can be usefully refined for the FER task
invented entities (3)
-
Landmark-Guided Adaptive Encoder (LGAE) with Bi-branch Gated Cross Attention (BGCA)
no independent evidence
-
Vision-Language Enhancement Strategy (VLES)
no independent evidence
-
Expression-Conditioned Prompting (ECP)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A. Mehrabian,Silent Messages. Belmont, CA, USA: Wadsworth, 1971
work page 1971
-
[2]
Sixteen facial expressions occur in similar contexts worldwide,
A. S. Cowen, D. Keltner, F. Schroff, B. Jou, H. Adam, and G. Prasad, “Sixteen facial expressions occur in similar contexts worldwide,”Nature, vol. 589, no. 7841, pp. 251–257, 2021
work page 2021
-
[3]
Personalized machine learning for robot perception of affect and engagement in autism therapy,
O. Rudovic, J. Lee, M. Dai, B. Schuller, and R. W. Picard, “Personalized machine learning for robot perception of affect and engagement in autism therapy,”Sci. Robot., vol. 3, no. 19, p. eaao6760, 2018
work page 2018
-
[4]
Towards facial expression analysis in a driver assistance system,
T. Wilhelm, “Towards facial expression analysis in a driver assistance system,” inProc. 14th IEEE Int. Conf. Autom. Face & Gesture Recognit. (FG), 2019, pp. 1–4
work page 2019
-
[5]
Emotion-aware connected health- care big data towards 5G,
M. S. Hossain and G. Muhammad, “Emotion-aware connected health- care big data towards 5G,”IEEE Internet Things J., vol. 5, no. 4, pp. 2399–2406, 2017
work page 2017
-
[6]
Toward label-efficient emotion and sentiment analysis,
S. Zhao, X. Hong, J. Yang, Y . Zhao, and G. Ding, “Toward label-efficient emotion and sentiment analysis,”Proc. IEEE, vol. 111, no. 10, pp. 1159– 1197, 2023
work page 2023
-
[7]
Predicting per- sonalized image emotion perceptions in social networks,
S. Zhao, H. Yao, Y . Gao, G. Ding, and T.-S. Chua, “Predicting per- sonalized image emotion perceptions in social networks,”IEEE Trans. Affective Comput., vol. 9, no. 4, pp. 526–540, 2016
work page 2016
-
[8]
Region attention networks for pose and occlusion robust facial expression recognition,
K. Wang, X. Peng, J. Yang, D. Meng, and Y . Qiao, “Region attention networks for pose and occlusion robust facial expression recognition,” IEEE Trans. Image Process., vol. 29, pp. 4057–4069, 2020
work page 2020
-
[9]
A dual stream attention network for facial expression recognition in the wild,
H. Tang, Y . Li, and Z. Jin, “A dual stream attention network for facial expression recognition in the wild,”Int. J. Mach. Learn. Cybern., vol. 15, no. 12, pp. 5863–5880, 2024
work page 2024
-
[10]
Facial expression recognition with visual transformers and attentional selective fusion,
F. Ma, B. Sun, and S. Li, “Facial expression recognition with visual transformers and attentional selective fusion,”IEEE Trans. Affective Comput., vol. 14, no. 2, pp. 1236–1248, Jun. 2021
work page 2021
-
[11]
Transfer: Learning relation-aware facial expression representations with transformers,
F. Xue, Q. Wang, and G. Guo, “Transfer: Learning relation-aware facial expression representations with transformers,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 3601–3610
work page 2021
-
[12]
QCS: Feature refining from quadruplet cross similarity for facial expression recognition,
C. Wang, L. Chen, L. Wang, Z. Li, and X. Lv, “QCS: Feature refining from quadruplet cross similarity for facial expression recognition,” in Proc. AAAI Conf. Artif. Intell., vol. 39, no. 7, 2025, pp. 7563–7572
work page 2025
-
[13]
POSTER: A pyramid cross- fusion transformer network for facial expression recognition,
C. Zheng, M. Mendieta, and C. Chen, “POSTER: A pyramid cross- fusion transformer network for facial expression recognition,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 3146–3155
work page 2023
-
[14]
POSTER++: A simpler and stronger facial expression recognition network,
J. Mao, R. Xu, X. Yin, Y . Chang, B. Nie, A. Huang, and Y . Wang, “POSTER++: A simpler and stronger facial expression recognition network,”Pattern Recognit., vol. 157, p. 110951, 2025
work page 2025
-
[15]
LA-Net: Landmark-aware learning for reliable facial expression recognition under label noise,
Z. Wu and J. Cui, “LA-Net: Landmark-aware learning for reliable facial expression recognition under label noise,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 20698–20707
work page 2023
-
[16]
Adaptive multilayer perceptual attention network for facial expression recognition,
H. Liu, H. Cai, Q. Lin, X. Li, and H. Xiao, “Adaptive multilayer perceptual attention network for facial expression recognition,”IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 9, pp. 6253–6266, 2022
work page 2022
-
[17]
Estimation of continuous valence and arousal levels from faces in naturalistic conditions,
A. Toisoul, J. Kossaifi, A. Bulat, G. Tzimiropoulos, and M. Pantic, “Estimation of continuous valence and arousal levels from faces in naturalistic conditions,”Nat. Mach. Intell., vol. 3, no. 1, pp. 42–50, 2021
work page 2021
-
[18]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763
work page 2021
-
[19]
CLIPER: A unified vision-language framework for in-the-wild facial expression recognition,
H. Li, H. Niu, Z. Zhu, and F. Zhao, “CLIPER: A unified vision-language framework for in-the-wild facial expression recognition,” inProc. IEEE Int. Conf. Multimedia Expo (ICME), 2024, pp. 1–6
work page 2024
-
[20]
CEPrompt: Cross-modal emotion-aware prompting for facial expression recognition,
H. Zhou, S. Huang, F. Zhang, and C. Xu, “CEPrompt: Cross-modal emotion-aware prompting for facial expression recognition,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 11, pp. 11886–11899, 2024
work page 2024
-
[21]
Z. Zheng, H. Wu, J. Wang, L. Lv, D. Bardou, and G. Yu, “VLCA: Vision-language feature enhancement with cross-attention learning for facial expression recognition,”Expert Syst. Appl., p. 130292, 2025
work page 2025
-
[22]
Text prompt region decomposition for effective facial expression recognition,
W. Nie, H. Zhang, X. Zhang, Z. Wang, and H. Liu, “Text prompt region decomposition for effective facial expression recognition,”IEEE Trans. Affective Comput., 2025
work page 2025
-
[23]
E. Pei, H. Zhao, T. Zhang, D. Jiang, L. He, and H. Chen, “Multi-Modal Prompt Learning for Facial Expression Recognition: Leveraging Emojis and Large Language Models,”Inf. Fusion, p. 104063, 2025
work page 2025
-
[24]
A compact embedding for facial expression similarity,
R. Vemulapalli and A. Agarwala, “A compact embedding for facial expression similarity,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 5683–5692
work page 2019
-
[25]
FaceNet: A unified embedding for face recognition and clustering,
F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 815–823
work page 2015
-
[26]
Facial expression recognition based on local binary patterns: A comprehensive study,
C. Shan, S. Gong, and P. W. McOwan, “Facial expression recognition based on local binary patterns: A comprehensive study,”Image Vis. Comput., vol. 27, no. 6, pp. 803–816, 2009
work page 2009
-
[27]
Recognizing facial actions using gabor wavelets with neutral face average difference,
J. J. Bazzo and M. V . Lamar, “Recognizing facial actions using gabor wavelets with neutral face average difference,” inProc. 6th IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), 2004, pp. 505–510
work page 2004
-
[28]
Island loss for learning discriminative features in facial expression recognition,
J. Cai, Z. Meng, A. S. Khan, Z. Li, J. O’Reilly, and Y . Tong, “Island loss for learning discriminative features in facial expression recognition,” inProc. 13th IEEE Int. Conf. Autom. Face & Gesture Recognit. (FG), 2018, pp. 302–309
work page 2018
-
[29]
Identity–expression dual branch network for facial expression recognition,
H. Zhang, W. Su, J. Yu, and Z. Wang, “Identity–expression dual branch network for facial expression recognition,”IEEE Trans. Cogn. Dev. Syst., vol. 13, no. 4, pp. 898–911, 2020
work page 2020
-
[30]
Low-resolution facial expression recognition: A filter learning perspective,
Y . Yan, Z. Zhang, S. Chen, and H. Wang, “Low-resolution facial expression recognition: A filter learning perspective,”Signal Process., vol. 169, p. 107370, 2020
work page 2020
-
[31]
Suppressing uncer- tainties for large-scale facial expression recognition,
K. Wang, X. Peng, J. Yang, S. Lu, and Y . Qiao, “Suppressing uncer- tainties for large-scale facial expression recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 6897–6906
work page 2020
-
[32]
Robust lightweight facial expression recognition network with label distribution training,
Z. Zhao, Q. Liu, and F. Zhou, “Robust lightweight facial expression recognition network with label distribution training,” inProc. AAAI Conf. Artif. Intell., vol. 35, no. 4, 2021, pp. 3510–3519
work page 2021
-
[33]
Multi-relations aware network for in-the-wild facial expression recognition,
D. Chen, G. Wen, H. Li, R. Chen, and C. Li, “Multi-relations aware network for in-the-wild facial expression recognition,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 3848–3859, 2023
work page 2023
-
[34]
An image is worth 16×16 words: Transformers for image recognition at scale,
A. Dosovitskiyet al., “An image is worth 16×16 words: Transformers for image recognition at scale,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021, pp. 1–11
work page 2021
-
[35]
Vision transformer with attentive pooling for robust facial expression recognition,
F. Xue, Q. Wang, Z. Tan, Z. Ma, and G. Guo, “Vision transformer with attentive pooling for robust facial expression recognition,”IEEE Trans. Affective Comput., vol. 14, no. 4, pp. 3244–3256, 2022
work page 2022
-
[36]
Chen,PyTorch Face Landmark: A Fast and Accurate Facial Landmark Detector, 2021
C. Chen,PyTorch Face Landmark: A Fast and Accurate Facial Landmark Detector, 2021. [Online]. Available: https://github.com/cunjian/pytorchfacelandmark
work page 2021
-
[37]
L. Zhao, B. Pu, X. Qi, C. Zhu, Q. Lin, C. Wang, and K. Li, “ICoCO: Interpretable concept-guided context optimization for trustworthy facial expression recognition in mental health monitoring,”IEEE Trans. Affec- tive Comput., 2026
work page 2026
-
[38]
ArcFace: Additive angular margin loss for deep face recognition,
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive angular margin loss for deep face recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 4690–4699
work page 2019
-
[39]
S. Li, W. Deng, and J. Du, “Reliable crowdsourcing and deep locality- preserving learning for expression recognition in the wild,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 2852–2861
work page 2017
-
[40]
Training deep networks for facial expression recognition with crowd-sourced label distribution,
E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” inProc. ACM Int. Conf. Multimodal Interact. (ICMI), 2016, pp. 279–283
work page 2016
-
[41]
AffectNet: A database for facial expression, valence, and arousal computing in the wild,
A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A database for facial expression, valence, and arousal computing in the wild,”IEEE Trans. Affective Comput., vol. 10, no. 1, pp. 18–31, Mar. 2017
work page 2017
-
[42]
Challenges in representation learning: A report on three machine learning contests,
I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y . Tang, D. Thaler, D.-H. Lee,et al., “Challenges in representation learning: A report on three machine learning contests,” inProc. Int. Conf. Neural Inf. Process. (ICONIP), 2013, pp. 117–124
work page 2013
- [43]
-
[44]
J. She, Y . Hu, H. Shi, J. Wang, Q. Shen, and T. Mei, “Dive into ambi- guity: Latent distribution mining and pairwise uncertainty estimation for facial expression recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 6248–6257
work page 2021
-
[45]
Learn from all: Erasing attention consistency for noisy label facial expression recognition,
Y . Zhang, C. Wang, X. Ling, and W. Deng, “Learn from all: Erasing attention consistency for noisy label facial expression recognition,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 418–434
work page 2022
-
[46]
Teaching with soft label smoothing for mitigating noisy labels in facial expressions,
T. Lukov, N. Zhao, G. H. Lee, and S.-N. Lim, “Teaching with soft label smoothing for mitigating noisy labels in facial expressions,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 648–665
work page 2022
-
[47]
C. Li, X. Li, X. Wang, D. Huang, Z. Liu, and L. Liao, “FG-AGR: Fine- grained associative graph representation for facial expression recognition in the wild,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 2, pp. 882–896, 2023. 13
work page 2023
-
[48]
Robust facial expression recognition by simultaneously addressing hard and mislabeled samples,
Y . Min, R. Xu, J. Chen, Y . Ji, and X. Liu, “Robust facial expression recognition by simultaneously addressing hard and mislabeled samples,” Pattern Recognit., vol. 170, p. 112026, 2026
work page 2026
-
[49]
D. Guo and F. Xu, “AUNet: An action unit–driven local–global interac- tive attention network with emotion-aware contrastive learning for facial expression recognition,”Knowl.-Based Syst., p. 115569, 2026
work page 2026
-
[50]
PIDViT: Pose-invariant distilled vision transformer for facial expression recognition in the wild,
Y .-F. Huang and C.-H. Tsai, “PIDViT: Pose-invariant distilled vision transformer for facial expression recognition in the wild,”IEEE Trans. Affective Comput., vol. 14, no. 4, pp. 3281–3293, 2022
work page 2022
-
[51]
ExpLLM: Towards Chain of Thought for Facial Expression Recognition,
X. Lan, J. Xue, J. Qi, D. Jiang, K. Lu, and T.-S. Chua, “ExpLLM: Towards Chain of Thought for Facial Expression Recognition,”IEEE Trans. Multimedia, 2025
work page 2025
-
[52]
AMGSN: Adaptive mask-guide supervised network for debiased facial expression recognition,
T. Gu, H. Li, X. Feng, and Y . Luo, “AMGSN: Adaptive mask-guide supervised network for debiased facial expression recognition,”Pattern Recognit., vol. 170, p. 112023, 2026
work page 2026
-
[53]
Z. Zhao, Q. Liu, and S. Wang, “Learning deep global multi-scale and local attention features for facial expression recognition in the wild,” IEEE Trans. Image Process., vol. 30, pp. 6544–6556, 2021
work page 2021
-
[54]
CRS-CONT: A Well-Trained General Encoder for Facial Expression Analysis,
H. Li, N. Wang, X. Yang, and X. Gao, “CRS-CONT: A Well-Trained General Encoder for Facial Expression Analysis,”IEEE Trans. Image Process., vol. 31, pp. 4637–4650, 2022
work page 2022
-
[55]
Y . Huang, J. Peng, W. Zhang, T. Zhao, G. Chen, S. Tan, F. Yi, and L. Wang, “FERMixNet: An occlusion robust facial expression recognition model with facial mixing augmentation and mid-level representation learning,”IEEE Trans. Affective Comput., vol. 16, no. 2, pp. 639–654, 2024
work page 2024
-
[56]
Adaptively learning facial expression representation via cf labels and distillation,
H. Li, N. Wang, X. Ding, X. Yang, and X. Gao, “Adaptively learning facial expression representation via cf labels and distillation,”IEEE Trans. Image Process., vol. 30, pp. 2016–2028, 2021
work page 2016
-
[57]
MS-Celeb-1M: A dataset and benchmark for large-scale face recognition,
Y . Guo, L. Zhang, Y . Hu, X. He, and J. Gao, “MS-Celeb-1M: A dataset and benchmark for large-scale face recognition,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 87–102
work page 2016
-
[58]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[59]
L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”J. Mach. Learn. Res., vol. 9, no. 11, pp. 2579–2605, 2008. Jiaxin Wangreceived the B.E. degree in Computer Science and Technology from Yangtze University in 2025. She is currently pursuing the M.E. degree in Control Science and Engineering at Shandong University, Weihai, China. Her current re...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.