pith. sign in

arxiv: 2606.12036 · v1 · pith:WQLQEAKSnew · submitted 2026-06-10 · 💻 cs.CV

Vision Transformers for Face Recognition Need More Registers

Pith reviewed 2026-06-27 09:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision transformerface recognitionregister tokensattention artifactsconcatenated patch embeddingsinterpretabilityIJB-BIJB-C
0
0 comments X

The pith

Register tokens added to CPE-based Vision Transformers reduce attention artifacts and reach state-of-the-art face recognition accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Concatenated Patch Embeddings improve Vision Transformer performance for face recognition but introduce consistent attention map artifacts that reduce interpretability. Adding learnable register tokens to the patch embeddings mitigates these artifacts across model sizes, with eight registers yielding the clearest maps and best results. The resulting ViT-8R model sets new benchmarks on the large-scale IJB-B and IJB-C datasets. A sympathetic reader would care because the clearer attention structures provide direct insight into what facial regions the model uses for identity decisions.

Core claim

Incorporating register tokens into CPE-based ViT architectures for face recognition mitigates attention artifacts observed in baseline models, with eight registers delivering the highest verification accuracies on IJB-B and IJB-C and substantially clearer attention maps compared to the baseline CPE model.

What carries the argument

Register tokens: learnable tokens concatenated to the initial patch embeddings and processed jointly through the ViT encoder blocks.

If this is right

  • Adding four or eight registers significantly enhances interpretability of attention maps.
  • Eight registers provide the highest verification accuracies and smoothest attention structures.
  • The ViT-8R model achieves state-of-the-art performance among ViT-based FR models on IJB-B and IJB-C.
  • Artifacts appear consistently across small and large ViT backbones and are mitigated by registers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The register token method could be tested on other ViT tasks where attention interpretability matters, such as object detection.
  • The optimal number of registers may depend on input complexity, suggesting experiments with different counts per domain.
  • Clearer attention maps could aid in diagnosing biases within face recognition systems without extra post-processing.

Load-bearing premise

The accuracy gains and artifact reductions are caused by the register tokens rather than by any unstated differences in training procedure, augmentation, or hyperparameters.

What would settle it

Retrain the baseline CPE model using the exact same training procedure, data augmentation, and hyperparameters as the eight-register version, then check whether the attention artifacts remain at similar levels.

Figures

Figures reproduced from arXiv: 2606.12036 by Eduarda Caldeira, Fadi Boutros, Guray Ozgur, Naser Damer, Tahar Chettaoui.

Figure 1
Figure 1. Figure 1: Attention maps of CPE-based ViT-B for FR, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CPE-based ViT with Registers pipeline. The image [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CPE-based attention maps, averaged across five benchmarks (see Section IV), of ViT-L, ViT-B, ViT-S, and ViT-T backbones for FR. All architectures display artifacts [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attention map of the CLS token in ViT-B averaged across five benchmarks (see Section IV). The CLS-based ViT-B does not suffer from background artifacts. MS-Celeb-1M dataset [19], curated and refined by [16], and comprises 5.8M images of 85K identities. MS1MV3 is a cleaned and updated iteration of MS-Celeb-1M, containing 5.2M images of 96k identities. WebFace4M is a subset of WebFace260M [51], consisting of… view at source ↗
Figure 5
Figure 5. Figure 5: Attention maps of CPE-based ViT-B for FR, [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention maps of KP-RPE [29], averaged across five benchmarks (see Section 4), comparing models trained with 0, 1, 2, 4, and 8 registers. Similarly to ViT, we observe that artifacts are reduced once 4 or 8 registers are added. FLOPs and parameter count. The CPE-based methods con￾catenate all patch embeddings z i L ∈ R D for i = 1, . . . , N, which are then projected to a final feature vector. While the re… view at source ↗
read the original abstract

Recent advances in Vision Transformers (ViTs) for face recognition (FR) have moved beyond the standard CLS-token paradigm. In this paradigm, a special classification token (CLS) is prepended to the patch embeddings and used as a representation of the input for downstream tasks. An alternative approach, Concatenated Patch Embeddings (CPE), instead leverages all patch tokens by concatenating them into a single vector, which is then projected into a compact face representation. CPE has been shown to improve recognition performance in comparison to CLS-based ones, but our qualitative analysis of attention maps showed the presence of artifacts that limit their interpretability. To address this issue, we incorporate register tokens, learnable tokens concatenated to the initial patch embeddings, and processed jointly through the ViT encoder blocks. This mechanism has been shown to produce more structured and interpretable attention maps compared to baseline ViT. We empirically demonstrate that these artifacts consistently appear across various ViT backbones, including small and large models, and that introducing register tokens effectively mitigates them. Adding four or eight registers significantly enhances interpretability, with eight registers providing the highest verification accuracies and smoothest attention structures. Our resulting model, ViT-8R, corresponds to a CPE-based ViT-B architecture augmented with eight register tokens achieves state-of-the-art performance among ViT-based FR models on large-scale IJB-B and IJB-C benchmarks. Also, ViT-8R produces substantially clearer attention maps compared with the baseline model, which offer deeper insight into the model's attention behavior (https://github.com/TaharChettaoui/ViT-FR-Registers)

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes adding learnable register tokens to Concatenated Patch Embeddings (CPE) Vision Transformers for face recognition to reduce attention artifacts and boost performance. It reports that 4–8 registers yield clearer attention maps across backbones and that the resulting ViT-8R (CPE ViT-B + 8 registers) achieves state-of-the-art verification accuracy among ViT-based models on the large-scale IJB-B and IJB-C benchmarks, with code released at the provided GitHub repository.

Significance. If the accuracy gains and artifact reduction are causally due to the registers, the work supplies a lightweight architectural change that simultaneously improves accuracy and interpretability for ViT-based face recognition; the public code release is a concrete strength that supports reproducibility.

major comments (2)
  1. [Experimental protocol / results comparison] The central empirical claim (abstract and results) that ViT-8R outperforms the CPE baseline and reaches SOTA rests on the assumption that the two models share identical training details (optimizer, learning-rate schedule, data augmentation, loss, and initialization). No explicit statement, table, or section confirms this matching; without it the performance delta cannot be attributed to the register tokens alone.
  2. [Attention map analysis] The qualitative claim that registers 'effectively mitigate' artifacts 'consistently across various ViT backbones' is load-bearing for the interpretability contribution, yet the manuscript provides no quantitative metric (e.g., attention entropy or artifact count) to support the visual examples; the improvement therefore remains qualitative only.
minor comments (1)
  1. [Abstract] The abstract states 'ViT-8R ... achieves state-of-the-art performance' but does not list the exact competing ViT-based methods or their scores; a compact comparison table would strengthen the SOTA claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Experimental protocol / results comparison] The central empirical claim (abstract and results) that ViT-8R outperforms the CPE baseline and reaches SOTA rests on the assumption that the two models share identical training details (optimizer, learning-rate schedule, data augmentation, loss, and initialization). No explicit statement, table, or section confirms this matching; without it the performance delta cannot be attributed to the register tokens alone.

    Authors: We agree that an explicit confirmation of identical training protocols is required to attribute performance differences to the register tokens. All models in our experiments were trained with the same optimizer, learning-rate schedule, data augmentations, loss function, and initialization, with the sole difference being the addition of register tokens. In the revised manuscript we will add a dedicated paragraph in the Experimental Setup section that states this matching explicitly, along with a table summarizing the shared hyperparameters. revision: yes

  2. Referee: [Attention map analysis] The qualitative claim that registers 'effectively mitigate' artifacts 'consistently across various ViT backbones' is load-bearing for the interpretability contribution, yet the manuscript provides no quantitative metric (e.g., attention entropy or artifact count) to support the visual examples; the improvement therefore remains qualitative only.

    Authors: We acknowledge that the attention-map analysis is currently supported only by qualitative examples. While the visual results show consistent artifact reduction, we agree a quantitative metric would strengthen the claim. In the revision we will add a quantitative evaluation (e.g., mean attention entropy or a systematic count of artifact regions) computed on held-out images to complement the qualitative figures. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation on public benchmarks

full rationale

The paper reports experimental results from training CPE-based ViT models augmented with register tokens and measuring verification accuracy plus attention-map quality on IJB-B and IJB-C. No derivation, equation, or prediction is presented that reduces by construction to a fitted parameter or self-citation chain defined inside the work. All central claims rest on externally falsifiable benchmark numbers and qualitative visualizations whose validity does not depend on any internal redefinition of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The improvement rests on the empirical selection of the number of registers and on the background claim from prior work that register tokens improve attention structure; no new entities are postulated.

free parameters (1)
  • Number of register tokens = 8
    The authors test four and eight registers and select eight because it yields the highest verification accuracies and smoothest attention structures.
axioms (1)
  • domain assumption Register tokens produce more structured and interpretable attention maps in Vision Transformers
    Invoked in the abstract as a known property shown in earlier ViT work and used to motivate the addition of registers to the CPE face-recognition pipeline.

pith-pipeline@v0.9.1-grok · 5838 in / 1398 out tokens · 21798 ms · 2026-06-27T09:45:32.945013+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Alansari, O

    M. Alansari, O. A. Hay, S. Javed, A. Shoufan, Y . H. Zweiri, and N. Werghi. Ghostfacenets: Lightweight face recognition model from cheap operations.IEEE Access, 11:35429–35446, 2023

  2. [2]

    X. An, X. Zhu, Y . Gao, Y . Xiao, Y . Zhao, Z. Feng, L. Wu, B. Qin, M. Zhang, D. Zhang, and Y . Fu. Partial FC: training 10 million identities on a single machine. InICCVW, pages 1445–1449. IEEE, 2021

  3. [3]

    Boutros, N

    F. Boutros, N. Damer, F. Kirchbuchner, and A. Kuijper. Elasticface: Elastic margin loss for deep face recognition. InCVPR Workshops, pages 1577–1586. IEEE, 2022

  4. [4]

    Boutros, P

    F. Boutros, P. Siebke, M. Klemt, N. Damer, F. Kirchbuchner, and A. Kuijper. Pocketnet: Extreme lightweight face recognition network using neural architecture search and multistep knowledge distillation. IEEE Access, 10:46823–46833, 2022

  5. [5]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision trans- formers. InICCV, pages 9630–9640. IEEE, 2021

  6. [6]

    C. R. Chen, Q. Fan, and R. Panda. Crossvit: Cross-attention multi- scale vision transformer for image classification. InICCV, pages 347–

  7. [7]

    S. Chen, Y . Liu, X. Gao, and Z. Han. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. InCCBR, volume 10996 ofLecture Notes in Computer Science, pages 428–438. Springer, 2018

  8. [8]

    Chettaoui, N

    T. Chettaoui, N. Damer, and F. Boutros. Froundation: Are foundation models ready for face recognition?Image Vis. Comput., 156:105453, 2025

  9. [9]

    E. D. Cubuk, B. Zoph, J. Shlens, and Q. V . Le. Randaugment: Practical automated data augmentation with a reduced search space. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020, pages 3008–3017. Computer Vision Foundation / IEEE, 2020

  10. [10]

    J. Dan, Y . Liu, B. Sun, J. Deng, and S. Luo. Transface++: Rethinking the face recognition paradigm with a focus on accuracy, efficiency, and security.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2025

  11. [11]

    J. Dan, Y . Liu, H. Xie, J. Deng, H. Xie, X. Xie, and B. Sun. Transface: Calibrating transformer training for face recognition from a data- centric perspective. InICCV, pages 20585–20596. IEEE, 2023

  12. [12]

    Darcet, M

    T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski. Vision transform- ers need registers. InICLR. OpenReview.net, 2024

  13. [13]

    J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InCVPR, pages 5202–5211. Computer Vision Foundation / IEEE, 2020

  14. [14]

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InCVPR, pages 4690–4699. Computer Vision Foundation / IEEE, 2019

  15. [15]

    J. Deng, J. Guo, J. Yang, A. Lattas, and S. Zafeiriou. Variational prototype learning for deep face recognition. InCVPR, pages 11906– 11915. Computer Vision Foundation / IEEE, 2021

  16. [16]

    J. Deng, J. Guo, J. Yang, N. Xue, I. Kotsia, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):5962–5979, Oct. 2022

  17. [17]

    J. Deng, J. Guo, D. Zhang, Y . Deng, X. Lu, and S. Shi. Lightweight face recognition challenge. InICCV Workshops, pages 2638–2646. IEEE, 2019

  18. [18]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR. OpenReview.net, 2021

  19. [19]

    Y . Guo, L. Zhang, Y . Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors,Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, volume 9907 ofLecture Notes in Computer Science, pages 8...

  20. [20]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InCVPR, pages 770–778. IEEE Computer Society, 2016

  21. [21]

    M. I. Hosen and M. B. Islam. Himfr: A hybrid masked face recognition through face inpainting.CoRR, abs/2209.08930, 2022

  22. [22]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. InICLR. OpenReview.net, 2022

  23. [23]

    G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments. InWorkshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France, Oct. 2008. Erik Learned-Miller and Andras Ferencz and Fr´ed´eric Jurie

  24. [24]

    Huang, Y

    Y . Huang, Y . Wang, Y . Tai, X. Liu, P. Shen, S. Li, J. Li, and F. Huang. Curricularface: Adaptive curriculum learning loss for deep face recognition. InCVPR, pages 5900–5909. Computer Vision Foundation / IEEE, 2020

  25. [25]

    Islam, M

    K. Islam, M. Z. Zaheer, and A. Mahmood. Face pyramid vision transformer. InBMVC, page 758. BMV A Press, 2022

  26. [26]

    Jain and B

    S. Jain and B. C. Wallace. Attention is not explanation. InNAACL- HLT (1), pages 3543–3556. Association for Computational Linguistics, 2019

  27. [27]

    M. Khan, M. Saeed, A. El-Saddik, and W. Gueaieb. Artrivit: Auto- matic face recognition system using vit-based siamese neural networks with a triplet loss. InISIE, pages 1–6. IEEE, 2023

  28. [28]

    M. Kim, A. K. Jain, and X. Liu. Adaface: Quality adaptive margin for face recognition. InCVPR, pages 18729–18738. IEEE, 2022

  29. [29]

    M. Kim, Y . Su, F. Liu, A. Jain, and X. Liu. Keypoint relative position encoding for face recognition. InCVPR, pages 244–255. IEEE, 2024

  30. [30]

    Y . Kim, W. Park, and J. Shin. Broadface: Looking at tens of thousands of people at once for face recognition. InECCV (9), volume 12354 of Lecture Notes in Computer Science, pages 536–552. Springer, 2020

  31. [31]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InICLR (Poster). OpenReview.net, 2019

  32. [32]

    B. Maze, J. C. Adams, J. A. Duncan, N. D. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney, and P. Grother. IARPA janus benchmark - C: face dataset and protocol. InICB, pages 158–165. IEEE, 2018

  33. [33]

    Q. Meng, S. Zhao, Z. Huang, and F. Zhou. Magface: A universal representation for face recognition and quality assessment. InCVPR, pages 14225–14234. Computer Vision Foundation / IEEE, 2021

  34. [34]

    Mishra and K

    P. Mishra and K. Sarawadekar. Polynomial learning rate policy with warm restart for deep neural network. InTENCON, pages 2087–2092. IEEE, 2019

  35. [35]

    Moschoglou, A

    S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou. Agedb: the first manually collected, in-the-wild age database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, volume 2, page 5, 2017

  36. [36]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. As- sran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without supe...

  37. [37]

    Ozgur, E

    G. Ozgur, E. Caldeira, T. Chettaoui, J. N. Kolf, M. Huber, N. Damer, and F. Boutros. Vitnt-fiqa: Training-free face image quality assessment with vision transformers.CoRR, abs/2601.05741, 2026

  38. [38]

    L. Qin, M. Wang, C. Deng, K. Wang, X. Chen, J. Hu, and W. Deng. Swinface: A multi-task transformer for face recognition, expression recognition, age estimation and attribute estimation.IEEE Trans. Circuits Syst. Video Technol., 34(4):2223–2234, 2024

  39. [39]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, volume 139 ofProceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021

  40. [40]

    Rasley, S

    J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InKDD, pages 3505–3506. ACM, 2020

  41. [41]

    Sengupta, J

    S. Sengupta, J. Chen, C. Castillo, V . Patel, R. Chellappa, and D. Ja- cobs. Frontal to profile face verification in the wild. In2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016, 2016 IEEE Winter Conference on Applications of Computer Vision, W ACV 2016. Institute of Electrical and Electronics Engineers Inc., May 2016. Publisher C...

  42. [42]

    Y . Shi, X. Yu, K. Sohn, M. Chandraker, and A. K. Jain. Towards universal representation learning for deep face recognition. InCVPR, pages 6816–6825. Computer Vision Foundation / IEEE, 2020

  43. [43]

    Sun and G

    Z. Sun and G. Tzimiropoulos. Part-based face recognition with vision transformers. InBMVC, page 611. BMV A Press, 2022

  44. [44]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need.CoRR, abs/1706.03762, 2017

  45. [46]

    H. Wang, Y . Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. InCVPR, pages 5265–5274. Computer Vision Foundation / IEEE Computer Society, 2018

  46. [47]

    Whitelam, E

    C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. C. Adams, T. Miller, N. D. Kalka, A. K. Jain, J. A. Duncan, K. Allen, J. Cheney, and P. Grother. IARPA janus benchmark-b face dataset. InCVPR Workshops, pages 592–600. IEEE Computer Society, 2017

  47. [48]

    Zheng and W

    T. Zheng and W. Deng. Cross-pose lfw: A database for studying cross-pose face recognition in unconstrained environments. Technical Report 18-01, Beijing University of Posts and Telecommunications, February 2018

  48. [49]

    Cross-Age LFW: A Database for Studying Cross-Age Face Recognition in Unconstrained Environments

    T. Zheng, W. Deng, and J. Hu. Cross-age LFW: A database for studying cross-age face recognition in unconstrained environments. CoRR, abs/1708.08197, 2017

  49. [50]

    Zhong and W

    Y . Zhong and W. Deng. Face transformer for recognition.CoRR, abs/2103.14803, 2021

  50. [51]

    Z. Zhu, G. Huang, J. Deng, Y . Ye, J. Huang, X. Chen, J. Zhu, T. Yang, J. Lu, D. Du, and J. Zhou. Webface260m: A benchmark unveiling the power of million-scale deep face recognition. InCVPR, pages 10492– 10502. Computer Vision Foundation / IEEE, 2021