pith. machine review for the scientific record. sign in

arxiv: 2604.19196 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords face anti-spoofingdomain generalizationvision foundation modelsself-supervised learningDINOv2data augmentationcross-domain evaluationcomputational efficiency
0
0 comments X

The pith

Self-supervised vision models like DINOv2 deliver a strong efficient baseline for cross-domain face anti-spoofing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks fifteen pre-trained vision foundation models on face anti-spoofing tasks that must generalize to completely unseen environments and cameras. It finds that self-supervised vision transformers, especially DINOv2 with registers, suppress unhelpful attention patterns and pick up the fine details that distinguish real faces from spoofs. Adding targeted augmentations for spoofing data and a patch-weighted loss turns this into a vision-only system that matches or exceeds prior results on the MICO and limited-source-domain tests while using far less computation than approaches that add language models. Readers should care because the work shows a practical route to reliable anti-spoofing that avoids the cost and latency of multimodal pipelines.

Core claim

The paper demonstrates that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical fine-grained spoofing cues. When combined with Face Anti-Spoofing Data Augmentation, Patch-wise Data Augmentation, and Attention-weighted Patch Loss, the resulting vision-only baseline achieves state-of-the-art performance in the MICO protocol and outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency.

What carries the argument

DINOv2 with Registers backbone together with FAS-Aug, PDA, and APL, which together suppress attention artifacts and extract domain-generalizable spoofing cues.

If this is right

  • Vision-only pipelines can replace or reduce reliance on vision-language models for face anti-spoofing without sacrificing generalization.
  • Self-supervised pre-training supplies better fine-grained cues for spoof detection than supervised pre-training under domain shift.
  • The three proposed augmentations and loss term improve cross-domain accuracy while keeping model size and inference cost low.
  • The resulting baseline can serve as the visual backbone for any future multimodal face anti-spoofing system.
  • Deployment in resource-constrained settings becomes feasible because accuracy gains come with lower computational demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar vision-only benchmarking could lower the need for large multimodal models in other biometric or anomaly-detection tasks that face domain shift.
  • The efficiency advantage may allow real-time anti-spoofing to run on edge hardware where vision-language models cannot.
  • Repeating the benchmark with newer foundation models released after DINOv2 would test whether the observed ordering of models remains stable.
  • Field trials on live video streams with uncontrolled lighting and new attack types would reveal whether the reported gains survive outside the fixed protocols.

Load-bearing premise

The chosen fifteen models and the MICO plus LSD protocols together cover the full variety of domain shifts and spoofing variations that appear in real deployments.

What would settle it

Evaluate the same baseline on a new dataset that introduces camera types, lighting conditions, or spoofing materials absent from both MICO and LSD protocols; a large drop in cross-domain accuracy would show the claimed generalization does not hold.

Figures

Figures reproduced from arXiv: 2604.19196 by Koichi Ito, Mika Feng, Pierre Gallin-Martel, Takafumi Aoki.

Figure 1
Figure 1. Figure 1: Overview of our comprehensive benchmarking frame [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed vision-only baseline. Du [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks 15 pre-trained vision foundation models (supervised CNNs, supervised ViTs, self-supervised ViTs) for domain-generalizable face anti-spoofing under the MICO and Limited Source Domains (LSD) cross-domain protocols. It identifies DINOv2 with Registers as particularly effective at suppressing attention artifacts and capturing spoofing cues, then augments it with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA), and Attention-weighted Patch Loss (APL) to claim state-of-the-art results on MICO, outperformance on data-constrained LSD, and superior efficiency compared to vision-language model approaches. The work positions the resulting vision-only pipeline as a definitive baseline for both vision-only and future multimodal FAS systems.

Significance. If the empirical results hold under representative conditions, the paper provides a valuable, computationally efficient vision-only baseline that challenges the necessity of resource-heavy VLMs for FAS. The systematic comparison across model families offers concrete guidance on which pre-trained features best transfer to spoof detection, and the proposed augmentations and loss could serve as reusable components for the community.

major comments (2)
  1. [§4] §4 (Experimental Setup) and §4.2 (Protocols): The central SOTA and 'definitive baseline' claims rest on MICO and LSD being sufficiently challenging and representative of real-world domain shifts. The protocols appear focused on 2D print/replay attacks and limited camera/lighting variations; without additional experiments or explicit justification covering 3D masks, extreme environmental changes, or other attack types, the generalization conclusions do not fully follow even if the reported numbers are accurate.
  2. [§4.3] §4.3 (Quantitative Results): The performance tables report point estimates without error bars, standard deviations over multiple random seeds, or statistical significance tests comparing the proposed pipeline against baselines. This weakens the strength of the SOTA claim on MICO and the outperformance claim on LSD, as small differences could arise from training variance.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly list the exact domain-shift factors (e.g., camera models, lighting conditions, attack media) included in MICO versus LSD to help readers assess coverage.
  2. [Tables/Figures] Figure captions and table footnotes should clarify whether the reported metrics are averaged over multiple runs or single-run results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor and protocol justification that we address below. We have prepared revisions to strengthen the presentation of our results and claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup) and §4.2 (Protocols): The central SOTA and 'definitive baseline' claims rest on MICO and LSD being sufficiently challenging and representative of real-world domain shifts. The protocols appear focused on 2D print/replay attacks and limited camera/lighting variations; without additional experiments or explicit justification covering 3D masks, extreme environmental changes, or other attack types, the generalization conclusions do not fully follow even if the reported numbers are accurate.

    Authors: We agree that explicit justification is needed to support the scope of our claims. The MICO and LSD protocols are standard benchmarks in the FAS literature for assessing domain generalization under 2D print and replay attacks with camera and lighting shifts, as used in multiple prior works. Our paper positions the vision-only baseline specifically within these established protocols rather than claiming universal generalization. In the revised manuscript, we will expand §4.2 with a dedicated paragraph providing this justification, including references to prior usage of these protocols, and add a limitations subsection noting that 3D mask attacks and extreme conditions fall outside the current evaluation scope. This clarifies the claims without requiring new experiments. revision: partial

  2. Referee: [§4.3] §4.3 (Quantitative Results): The performance tables report point estimates without error bars, standard deviations over multiple random seeds, or statistical significance tests comparing the proposed pipeline against baselines. This weakens the strength of the SOTA claim on MICO and the outperformance claim on LSD, as small differences could arise from training variance.

    Authors: We concur that reporting only point estimates limits the interpretability of the comparisons. To address this, we will rerun the primary experiments across multiple random seeds and update the quantitative tables to include mean values with standard deviations. This revision will provide a stronger statistical foundation for the SOTA and outperformance statements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper performs systematic evaluation of 15 pre-trained vision models on cross-domain FAS protocols (MICO and LSD). It reports direct performance numbers from held-out test sets after applying combinations of existing augmentations (FAS-Aug, PDA) and a loss (APL) to a backbone (DINOv2+Registers). No equations, predictions, or uniqueness theorems are claimed; results are not fitted parameters renamed as outputs, nor do they reduce to self-citations by construction. The central claims rest on empirical measurements rather than any definitional or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

As an empirical benchmarking study the central claim rests on standard machine-learning assumptions about pre-trained model transfer and the representativeness of the chosen protocols rather than new free parameters, axioms, or invented entities.

axioms (2)
  • domain assumption Pre-trained vision models transfer useful features to the FAS task under domain shift
    Invoked when claiming that DINOv2 and similar models capture spoofing cues across unseen domains.
  • domain assumption The MICO and LSD protocols are representative of real-world domain generalization challenges
    Used to support the claim of state-of-the-art performance.

pith-pipeline@v0.9.0 · 5580 in / 1336 out tokens · 34598 ms · 2026-05-10T02:35:17.742159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Baevski, W

    A. Baevski, W. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli. data2vec: A general framework for self-supervised learnin g in speech, vision and language. Proc. Int’l Conf. Machine Learning, pages 1298–1312, 2022. 4, 6

  2. [2]

    H. Bao, L. Dong, S. Piao, and F. Wei. BEiT: BERT pre- 8 training of image transformers. Proc. Int’l Conf. Learning Representations, pages 1–13, 2022. 4, 6

  3. [3]

    Boulkenafet, J

    Z. Boulkenafet, J. Komulainen, and A. Hadid. Face anti- spoofing using speeded-up robust features and fisher vector encoding. IEEE Signal Processing Letters , 24(2):141–145, 2016. 2

  4. [4]

    Boulkenafet, J

    Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadi d. OULU-NPU: A mobile face presentation attack database with real-world variations. IEEE Int’l Conf. Automatic Face Gesture Recog., pages 612–618, 2017. 2, 5

  5. [5]

    R. Cai, C. Soh, Z. Y u, H. Li, W. Yang, and A. Kot. Towards data-centric face anti-spoofing: Improving cross- domain generalization via physics-based data synthesis. Int. J. Comput. Vis., pages 1–22, 2024. 2, 3, 4, 5, 8

  6. [6]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. J´ egou, J. Mairal, P . Bojanowski, and A. Joulin. Emerging properties in self- supervised vision transformers. Proc. IEEE/CVF Int’l Conf. Computer Vision, pages 9650–9660, 2021. 3, 4, 6

  7. [7]

    X. Chen, Y . Jia, and Y . Wu. Fine-grained annotation for fa ce anti-spoofing. CoRR, abs/2310.08142, 2023. 1, 2

  8. [8]

    Z. Chen, T. Yao, K. Sheng, S. Ding, Y . Tai, J. Li, F. Huang, and X. Jin. Generalizable representation learning for mixt ure domain face anti-spoofing. AAAI, pages 1132–1139, 2021. 2

  9. [9]

    Chingovska, A

    I. Chingovska, A. Anjos, and S. Marcel. On the effective- ness of local binary patterns in face anti-spoofing. Int. Conf. Biometrics Special Interest Group, pages 1–7, 2012. 2, 5

  10. [10]

    A. K. Jain D. Wen and H. Han. Face spoof detection with image distortion analysis. IEEE Trans. Inf. F orensics Secur ., pages 746–761, 2015. 5

  11. [11]

    Darcet, M

    T. Darcet, M. Oquab, J. Mairal, and P . Bojanowski. Vi- sion transformer needs register. Int. Conf. Learn. Represent.,

  12. [12]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. Int. Conf. Learn. Represent., 2021. 1, 3, 6

  13. [13]

    Feng, Gallin-Martel P

    M. Feng, Gallin-Martel P . A., K. Ito, and T. Aoki. Optimi z- ing DINOv2 with registers for face anti-spoofing. Int. Conf. Comput. Vis. W orksh., pages 3256–3262, 2025. 1, 2, 3

  14. [14]

    M. Feng, K. Ito, T. Aoki, T. Ohki, and M. Nishigaki. Lever - aging intermediate features of vision transformer for face anti-spoofing. IEEE/CVF Conf. Comput. Vis. Pattern Recog. W orksh., pages 3464–3472, 2025. 1, 2

  15. [15]

    X. Ge, X. Liu, Z. Y u, J. Shi, C. Qi, J. Li, and H. K¨ alvi¨ ainen. DiffFAS: Face anti-spoofing via generative diffusion model s. Eur . Conf. Comput. Vis., pages 144–161, 2024. 7, 8

  16. [16]

    K. He, X. Chen, S. Xie, Y . Li, P . Doll´ ar, and R. Girshick. Masked autoencoders are scalable vision learners. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 16000– 16009, 2022. 4, 6

  17. [17]

    X. He, D. Liang, S. Yang, Z. Hao, H. Ma, M. Binjie, X. Li, Y . Wang, P . Yan, and A. Liu. Joint physical-digital facial attack detection via simulating spoofing clues. IEEE/CVF Conf. Comput. Vis. Pattern Recog. W orksh., pages 995–1004, 2024. 1

  18. [18]

    Y . Jia, J. Zhang, S. Shan, and X. Chen. Single-side domai n generalization for face anti-spoofing. IEEE/CVF Conf. Com- put. Vis. Pattern Recog., pages 8484–8493, 2020. 7, 8

  19. [19]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, and R. Girshick. Segment anything. Int. Conf. Comput. Vis., pages 4015–4026, 2023. 2

  20. [20]

    Komulainen, A

    J. Komulainen, A. Hadid, and M. Pietik¨ ainen. Context based face anti-spoofing. Int. Conf. Biometrics: Theory, Applica- tions and Systems, pages 1–8, 2013. 2

  21. [21]

    Le and S

    B. Le and S. Woo. Gradient alignment for cross-domain fa ce anti-spoofing. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 188–189, 2024. 2, 7, 8

  22. [22]

    D. Li, G. Chen, X. Wu, Z. Y u, and M. Tan. Face anti- spoofing with cross-stage relation enhancement and spoof material perception. Neural Networks , 175:106275, 2024. 1, 7

  23. [23]

    Li and A

    S. Li and A. Jain. Handbook of Face Recognition. Springer,

  24. [24]

    Y . Li, G. Y uan, Y . Wen, J. Hu, G. Evangelidis, S. Tulyakov , Y . Wang, and J. Ren. EfficientFormer: Vision transformers at MobileNet speed. Advances in Neural Information Pro- cessing Systems, pages 12934–12949, 2022. 3, 6

  25. [25]

    C. Liao, W. Chen, H. Liu, Y . Yeh, M. Hu, and C. Chen. Domain invariant vision transformer learning for face anti - spoofing. IEEE/CVF Winter Conf. Applications of Comput. Vis., pages 6087–6096, 2023. 2, 7, 8

  26. [26]

    K. Lin, Y . Tseng, K. Huang, J. Wu, and W. Cheng. Instruct- FLIP: Exploring unified vision-language model for face anti - spoofing. ACM Int. Conf. Multimedia , pages 2987 –2996,

  27. [27]

    A. Liu, S. Xue, J. Gan, J. Wan, Liang Y ., J. Deng, S. Escalera, and Z. Lei. CFPL-FAS: Class free prompt learning for gen- eralizable face anti-spoofing. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 222–232, 2024. 2, 7, 8

  28. [28]

    S. Liu, K. Zhang, T. Yao, M. Bi, S. Ding, J. Li, M. Huang, and L. Ma. Adaptive normalized representation learning for generalizable face anti-spoofing. ACM Int. Conf. Multime- dia, pages 1469–1477, 2021. 2, 7, 8

  29. [29]

    S. Liu, K. Zhang, T. Yao, K. Sheng, S. Ding, Y . Tai, J. Li, Y . Xie, and L. Ma. Dual reweighting domain generalization for face presentation attack detection. IJCAI, pages 867–873, 2021. 2, 7, 8

  30. [30]

    Y . Liu, A. Jourabloo, and X. Liu. Learning deep mod- els for face anti-spoofing: Binary or auxiliary supervision . IEEE/CVF Conf. Comput. Vis. Pattern Recog. , pages 389– 398, 2018. 2

  31. [31]

    Y . Liu, Y . Chen, W. Dai, M. Gou, C. Huang, and H. Xiong. Source-free domain adaptation with contrastive do- main alignment and self-supervised exploration for face anti- spoofing. Eur . Conf. Comput. Vis., pages 511–528, 2022. 2

  32. [32]

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. Proc. IEEE/CVF Int’l Conf. Com- puter Vision, pages 10012–10022, 2021. 3, 6

  33. [33]

    Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A ConvNet for the 2020s. Proc. IEEE/CVF Conf. 9 Comput. Vis. Pattern Recog. , pages 11976–11986, 2022. 3, 6

  34. [34]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay reg u- larization. Int. Conf. Learn. Represent., 2019. 5

  35. [35]

    Marcel, J

    S. Marcel, J. Fierrez, and N. Evans. Handbook of Biometric Anti-Spoofing . Springer, 2023. 1, 2

  36. [36]

    Mehta and M

    S. Mehta and M. Rastegari. MobileVit: Light-weight, general-purpose, and mobile-friendly vision transformer . Proc. Int’l Conf. Learning Representations , pages 1–13,

  37. [37]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P . Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P .-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P . Labatut, A. Joulin, and P . Bojanowski. DINOv2: Learning robust visual features witho...

  38. [38]

    Patel, H

    K. Patel, H. Han, and A. K. Jain. Secure face unlock: Spoo f detection on smartphones. IEEE Trans. Inf. F orensics Secur ., 11(10):2268–2283, 2016. 2

  39. [39]

    T. F. Pereira, A. Anjos, J. De Martino, and S. Marcel. Can face anti-spoofing countermeasures work in a real world sce- nario? Int. Conf. Biometrics, pages 1–8, 2013. 2

  40. [40]

    Radford, J

    A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar - wal, G. Sastry, A. Askell, P . Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. Proc. Int’l Conf. Machine Learning, pages 8748–8763, 2021. 2, 4, 6

  41. [41]

    L2-constrained softmax loss for discriminative face verification,

    R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constra ined softmax loss for discriminative face verification. CoRR, abs/1703.09507, 2017. 5

  42. [42]

    R. Shao, X. Lan, J. Li, and P . Y uen. Multi-adversarial di s- criminative deep domain generalization for face presenta- tion attack detection. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 10023–10031, 2019. 2

  43. [43]

    R. Shao, X. Lan, and P . C. Y uen. Regularized fine-grained meta face anti-spoofing. AAAI, 34(7):11974–11981, 2020. 2

  44. [44]

    DINOv3

    O. Sim´ eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamon- jisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. V edaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´ egou, P . Labatut, and P. Bojanowski. DINOv3. CoRR, abs/2508.10104:1–52,...

  45. [45]

    Srivatsan, M

    K. Srivatsan, M. Naseer, and K. Nandakumar. FLIP: Cross - domain face anti-spoofing with language guidance. Int. Conf. Comput. Vis., pages 19685–19696, 2023. 2, 7, 8

  46. [46]

    Y . Sun, Y . Liu, X. Liu, Y . Li, and W. Chu. Rethinking do- main generalization for face anti-spoofing: separability a nd alignment. IEEE/CVF Conf. Comput. Vis. Pattern Recog. , pages 24563–24574, 2023. 2, 7

  47. [47]

    Touvron, M

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles , and H. J´ egou. Training data-efficient image transformers & distillation through attention. Proc. Int’l Conf. Machine Learning, pages 10347–10357, 2021. 3, 6

  48. [48]

    C. Wang, Y . Lu, S. Yang, and S. Lai. PatchNet: A simple face anti-spoofing framework via fine-grained patch recognition . IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 20281– 20290, 2022. 1, 2, 7

  49. [49]

    G. Wang, H. Han, S. Shan, and X. Chen. Cross-domain face presentation attack detection via multi-domain disentang led representation learning. IEEE/CVF Conf. Comput. Vis. Pat- tern Recog., pages 6677–6686, 2020. 2

  50. [50]

    Z. Wang, C. Zhao, Y . Qin, Q. Zhou, G. Qi, J. Wan, and Z. Lei. Exploiting temporal and depth information for multi-frame face anti-spoofing. CoRR, abs/1811.05118:1–15, 2018. 2

  51. [51]

    Z. Wang, Q. Wang, W. Deng, and G. Guo. Face anti-spoofing using transformers with relation-aware mechanism. IEEE Trans. Biom. Behav. Identity Sci. , 4(3):439–450, 2022. 1, 2, 7

  52. [52]

    Z. Wang, Z. Wang, Z. Y u, W. Deng, J. Li, T. Gao, and Z. Wang. Domain generalization via shuffled style assembly for face anti-spoofing. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 4123–4133, 2022. 2, 7, 8

  53. [53]

    Watanabe, K

    K. Watanabe, K. Ito, and T. Aoki. Spoofing attack detec- tion in face recognition system using vision transformer wi th patch-wise data augmentation. Asia-Pacific Signal and In- formation Processing Association Annual Summit and Conf., pages 1561–1565, 2022. 1, 2, 3, 4, 5

  54. [54]

    Z. Y u, J. Wan, Y . Qin, X. Li, S. Z. Li, and G. Zhao. NAS- FAS: Static-dynamic central difference network search for face anti-spoofing. IEEE Trans. Pattern Anal. Mach. Intell. , 43(9):3005–3023, 2020. 1, 2

  55. [55]

    Z. Y u, C. Zhao, Z. Wang, Y . Qin, Z. Su, X. Li, F. Zhou, and G. Zhao. Searching central difference convolutional ne t- works for face anti-spoofing. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 5295–5305, 2020. 1, 2

  56. [56]

    Zhang, J

    D. Zhang, J. Li, and Z. Shan. Implementation of dlib deep learning face recognition technology. Int. Conf. Robots & Intelligent System, pages 88–91, 2020. 5

  57. [57]

    Zhang, K

    G. Zhang, K. Wang, H. Y ue, A. Liu, G. Zhang, K. Yao, E. Ding, and J. Wang. Interpretable face anti-spoofing: Enhanc - ing generalization with multimodal large language models. AAAI, pages 9896–9904, 2025. 2, 7, 8

  58. [58]

    Zhang, J

    Z. Zhang, J. Yan, S. Liu, Z. Lei, D. Yi, and S. Z. Li. A face antispoofing database with diverse attacks. Int. Conf. Biometrics, pages 26–31, 2012. 5

  59. [59]

    Zheng, B

    T. Zheng, B. Li, S. Wu, B. Wan, G. Mu, S. Liu, S. Ding, and J. Wang. MFAE: Masked frequency autoencoders for domain generalization face anti-spoofing. IEEE Trans. Inf. F orensics Secur ., pages 4058–4069, 2024. 1, 2

  60. [60]

    Q. Zhou, K. Zhang, T. Yao, X. Lu, R. Yi, S. Ding, and L. Ma. Instance-aware domain generalization for face anti-spoofi ng. IEEE/CVF Conf. Comput. Vis. Pattern Recog., pages 20453– 20463, 2023. 2, 7, 8 10