pith. machine review for the scientific record. sign in

arxiv: 2604.22842 · v1 · submitted 2026-04-21 · 💻 cs.CV · eess.IV

Recognition: unknown

EX-FIQA: Leveraging Intermediate Early eXit Representations from Vision Transformers for Face Image Quality Assessment

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:58 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords face image quality assessmentvision transformersearly exitscore fusionintermediate representationsbiometricsdeep learninghierarchical features
0
0 comments X

The pith

Fusing quality scores from multiple depths of a vision transformer outperforms using only the final layer for face image quality assessment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision transformers for face image quality assessment hold useful information at intermediate layers, not solely in the deepest block. Systematic checks of all twelve transformer blocks reveal that shallower layers pick up different quality cues that complement deeper ones, visible in their attention patterns and standalone performance. A simple fusion method averages the quality scores from several blocks, with heavier weight on deeper ones, and does so without any architecture changes or retraining. This yields higher accuracy on eight benchmark datasets while also permitting early exits that cut computation when an early layer already gives a reliable score.

Core claim

By examining every transformer block in ViT-FIQA models, the work establishes that intermediate representations contain distinct and complementary quality-relevant signals. A depth-weighted averaging fusion that combines predictions from multiple blocks, assigning higher importance to deeper layers, delivers the strongest assessment performance across datasets and face recognition backbones. Early-exit analysis further shows that competitive results can be reached with far fewer layers, confirming that hierarchical feature learning in transformers can be leveraged directly for quality assessment without extra training.

What carries the argument

Depth-weighted averaging of quality scores from all twelve transformer blocks, combined with early-exit options.

If this is right

  • Quality assessment accuracy rises when predictions from several layers are combined rather than taken from the final block alone.
  • Early exits from intermediate blocks enable substantial compute savings while preserving competitive quality scores.
  • No architectural modifications or additional training are required to realize the gains from multi-depth fusion.
  • The hierarchical structure of vision transformers can be exploited for face analysis tasks by treating intermediate representations as informative rather than discarding them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-depth fusion idea could be tested on other vision tasks where quality or uncertainty varies by layer depth.
  • In mobile or edge biometric deployments, learned or adaptive fusion weights might further improve the performance-efficiency curve beyond the fixed depth-weighted scheme.
  • Attention map differences across layers suggest that quality defects such as blur or occlusion may be more detectable at specific depths, opening a route to defect-specific diagnostics.

Load-bearing premise

Different depths of the transformer capture distinct and complementary information that is relevant to face image quality.

What would settle it

A direct test on any of the eight benchmarks where the fused multi-depth score fails to exceed the single best single-block score would disprove the central performance claim.

Figures

Figures reproduced from arXiv: 2604.22842 by Andrea Atzori, Eduarda Caldeira, Fadi Boutros, Guray Ozgur, Jan Niklas Kolf, Naser Damer, Tahar Chettaoui.

Figure 1
Figure 1. Figure 1: Performance of EX-FIQA exits on FIQA across multiple datasets. The trade-off between computational complexity (GFLOPs) and performance (pAUC-EDC for FNMR@FMR =1e − 4) is shown. Deeper exits generally improve pAUC-EDC (the lower the better), while incurring higher computational cost. recognition. As demonstrated by Biying et al. [21], high perceived quality does not always correlate with FR utility, particu… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EX-FIQA (C) architecture and proposed early exit mechanism for inference. (1) A face image is divided into equally sized, non-overlapping patches and linearly projected to extract patch embeddings. (2) The patch embeddings are processed through a sequence of 12 transformer blocks. (3) Our early exit mechanism enables inference at any intermediate transformer block by extracting patch representa… view at source ↗
Figure 3
Figure 3. Figure 3: EX-FIQA (T) attention map evolution of the quality token across different samples and exit points. Visualization of attention patterns for four representative face images: a child face, a frontal pose, an occluded face, and a challenging pose of the same person, and the mean of all images from LFW [27] dataset. The individual appearing in the images have provided informed consent for their use in this rese… view at source ↗
Figure 3
Figure 3. Figure 3: Different intermediate layers capture distinct levels [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: EX-FIQA (C) attention map evolution across sample types and exit points. Visualization of attention patterns for four representative face images: a child face, a frontal pose, an occluded face, and a challenging pose. The top row displays original input images, followed by attention maps from Exit 1 through Exit 12. The individual appearing in the images have provided informed consent for their use in this… view at source ↗
Figure 5
Figure 5. Figure 5: EX-FIQA (T) attention map evolution across sample types and exit points. Visualization of attention patterns for four representative face images: a child face, a frontal pose, an occluded face, and a challenging pose. The top row displays original input images, followed by attention maps from Exit 1 through Exit 12. The individual appearing in the images have provided informed consent for their use in this… view at source ↗
Figure 7
Figure 7. Figure 7: EX-FIQA (T) model mean attention maps across datasets and exit points. Rows represent different exit points (1-12), while columns represent different datasets (AgeDB￾30 [43], CALFW [66], CFP-FP [54], CPLFW [65], LFW [27]). AGEDB-30 CFP-FP LFW CALFW CPLFW Exit 1 Exit 2 Exit 3 Exit 4 Exit 5 Exit 6 Exit 7 Exit 8 Exit 9 Exit 10 Exit 11 Exit 12 EX-FIQA (T) - Mean Attention Maps Across Dataset and Exit Points Ta… view at source ↗
Figure 9
Figure 9. Figure 9: Error-versus-Discard Characteristic (EDC) curves for [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Error-versus-Discard Characteristic (EDC) curves for [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: pAUC-EDC and AUC-EDC, FIQA performance metrics, for [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Error-versus-Discard Characteristic (EDC) curves for FNMR@FMR= [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Error-versus-Discard Characteristic (EDC) curves for FNMR@FMR= [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Error-versus-Discard Characteristic (EDC) curves for FNMR@FMR= [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
read the original abstract

Face Image Quality Assessment is crucial for reliable face recognition systems, yet existing Vision Transformer-based approaches rely exclusively on final-layer representations, ignoring quality-relevant information captured at intermediate network depths. This paper presents the first comprehensive investigation of how intermediate representations within ViTs contribute to face quality assessment through early exit mechanisms and score fusion strategies. We systematically analyze all twelve transformer blocks of ViT-FIQA architectures, demonstrating that different depths capture distinct and complementary quality-relevant information, as evidenced by varying attention patterns and performance characteristics across network layers. We propose a score fusion framework that combines quality predictions from multiple transformer blocks without architectural modifications or additional training. Our early exit analysis reveals optimal performance-efficiency trade-offs, enabling significant computational savings while maintaining competitive performance. Through extensive evaluation across eight benchmark datasets using four FR models, we demonstrate that our fusion strategy improves upon single-exit approaches. Our proposed quality fusion approach employs depth-weighted averaging that assigns progressively higher importance to deeper transformer blocks, achieving the best quality assessment performance by effectively leveraging the hierarchical nature of feature learning in ViTs. Our work challenges the conventional wisdom that only deep features matter for face analysis, revealing that intermediate representations contain valuable information for quality assessment. The proposed framework offers practical benefits for real-world biometric systems by enabling adaptive computation based on resource constraints while maintaining competitive quality assessment capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents EX-FIQA, the first systematic study of intermediate-layer early exits in Vision Transformers for face image quality assessment (FIQA). It examines all twelve transformer blocks of ViT-FIQA models, reports that different depths yield distinct attention patterns and quality-prediction performance, and introduces a training-free score-fusion framework that combines per-block predictions. The central technical contribution is a depth-weighted averaging scheme that assigns progressively higher weights to deeper blocks; the authors claim this exploits the hierarchical nature of ViT feature learning and yields the best FIQA results across eight benchmark datasets and four face-recognition backbones while also enabling early-exit efficiency trade-offs.

Significance. If the reported gains are robust, the work would usefully challenge the default practice of using only final-layer ViT embeddings for FIQA and would supply a practical, zero-training-cost method for trading accuracy against compute in biometric pipelines. The emphasis on complementary information across depths and the provision of early-exit analysis are potentially valuable for resource-constrained deployments.

major comments (2)
  1. [Experiments / fusion results] Experiments / fusion results: The manuscript asserts that depth-weighted averaging (progressively higher weights for deeper blocks) achieves the best performance by leveraging hierarchical feature learning. However, no ablation is presented that directly compares this scheme against uniform averaging of the twelve block scores, against max-pooling, or against a simple learned linear combination. Without such controls, it is impossible to determine whether the observed improvement stems from the specific increasing-depth weighting or merely from fusing multiple layers; this directly affects the load-bearing claim in the abstract and §4.
  2. [§4.2] §4.2 / early-exit analysis: The paper states that different depths capture complementary quality-relevant information, supported by “varying attention patterns and performance characteristics.” The quantitative evidence for complementarity (e.g., pairwise correlation of per-block quality scores or incremental gain when adding each layer) is not reported; only aggregate fusion numbers are given. This weakens the justification for the fusion framework itself.
minor comments (2)
  1. [Method] The exact functional form of the depth-weighted averaging (the sequence of weights w_d for d = 1…12) should be stated explicitly, preferably as an equation, rather than described only qualitatively as “progressively higher importance.”
  2. [Figures / Tables] Table captions and axis labels in the performance plots should indicate the precise fusion variant (depth-weighted vs. other) and the number of blocks used for each early-exit point.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and will revise the manuscript to incorporate the suggested analyses.

read point-by-point responses
  1. Referee: [Experiments / fusion results] Experiments / fusion results: The manuscript asserts that depth-weighted averaging (progressively higher weights for deeper blocks) achieves the best performance by leveraging hierarchical feature learning. However, no ablation is presented that directly compares this scheme against uniform averaging of the twelve block scores, against max-pooling, or against a simple learned linear combination. Without such controls, it is impossible to determine whether the observed improvement stems from the specific increasing-depth weighting or merely from fusing multiple layers; this directly affects the load-bearing claim in the abstract and §4.

    Authors: We agree that the absence of these specific controls limits the strength of the claim that the depth-weighted scheme is superior due to its exploitation of hierarchy rather than fusion in general. While the current manuscript shows that the proposed weighting outperforms single-block baselines and is motivated by observed performance trends across depths, we did not report direct comparisons to uniform averaging, max-pooling, or a learned linear combination. In the revised version we will add these ablations on the same eight datasets and four backbones, reporting the full set of results to isolate the contribution of the increasing-depth weighting. revision: yes

  2. Referee: [§4.2] §4.2 / early-exit analysis: The paper states that different depths capture complementary quality-relevant information, supported by “varying attention patterns and performance characteristics.” The quantitative evidence for complementarity (e.g., pairwise correlation of per-block quality scores or incremental gain when adding each layer) is not reported; only aggregate fusion numbers are given. This weakens the justification for the fusion framework itself.

    Authors: The manuscript supports the complementarity claim via layer-specific attention visualizations and distinct per-block performance numbers, which indicate that intermediate representations are not redundant. However, we did not provide explicit quantitative measures such as pairwise score correlations or incremental gains from cumulative fusion. We will add these analyses to §4.2 in the revision, including correlation matrices across the twelve blocks and step-wise fusion ablations, to give stronger empirical grounding for the fusion framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmark evaluations.

full rationale

The paper's core contribution is an empirical analysis of intermediate ViT blocks for face image quality assessment, followed by a proposed score fusion strategy using depth-weighted averaging. No mathematical derivation chain is presented that reduces a claimed result to its own inputs by construction. The depth-weighted scheme is described as a fixed heuristic that assigns higher weights to deeper blocks, with performance gains asserted via evaluation on eight datasets rather than by fitting parameters to the target metric and relabeling them as predictions. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are referenced in the abstract or described claims. The work is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the depth-weighted scheme is presented as a post-hoc fusion without training, implying the weights are either fixed or chosen outside the reported experiments.

pith-pipeline@v0.9.0 · 5565 in / 1120 out tokens · 84285 ms · 2026-05-10T02:58:11.889775+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Atzori, F

    A. Atzori, F. Boutros, and N. Damer. Vit-fiqa: Assessing face image quality using vision transformers. In2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2025

  2. [2]

    Babnik, D

    Z. Babnik, D. K. Jain, P. Peer, and V . Struc. FROQ: observing face recognition models for efficient quality assessment.CoRR, abs/2509.17689, 2025

  3. [3]

    Babnik, P

    Z. Babnik, P. Peer, and V . Struc. Faceqan: Face image quality assessment through adversarial noise exploration. In2022 26th International Conference on Pattern Recognition (ICPR), pages 748– 754, 2022

  4. [4]

    Babnik, P

    ˇZ. Babnik, P. Peer, and V . ˇStruc. Diffiqa: Face image quality assessment using denoising diffusion probabilistic models. In2023 IEEE International Joint Conference on Biometrics (IJCB), pages 1– 10, 2023

  5. [5]

    Babnik, P

    ˇZ. Babnik, P. Peer, and V . ˇStruc. eDifFIQA: Towards Efficient Face Image Quality Assessment based on Denoising Diffusion Probabilistic Models.IEEE Transactions on Biometrics, Behavior, and Identity Science (TBIOM), 2024

  6. [6]

    Bakhtiarnia, Q

    A. Bakhtiarnia, Q. Zhang, and A. Iosifidis. Multi-exit vision trans- former for dynamic inference. InBMVC, page 81. BMV A Press, 2021

  7. [7]

    Bakhtiarnia, Q

    A. Bakhtiarnia, Q. Zhang, and A. Iosifidis. Single-layer vision transformers for more accurate early exits with less overhead.Neural Networks, 153:461–473, 2022

  8. [8]

    Best-Rowden and A

    L. Best-Rowden and A. K. Jain. Learning face image quality from human assessments.IEEE Trans. Inf. Forensics Secur., 13(12):3064– 3077, 2018

  9. [9]

    Bosse, D

    S. Bosse, D. Maniry, K. M ¨uller, T. Wiegand, and W. Samek. Deep neural networks for no-reference and full-reference image quality assessment.IEEE Trans. Image Process., 27(1):206–219, 2018

  10. [10]

    Boutros, N

    F. Boutros, N. Damer, F. Kirchbuchner, and A. Kuijper. Elasticface: Elastic margin loss for deep face recognition. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2022, New Orleans, LA, USA, June 19-20, 2022, pages 1577–1586. IEEE, 2022

  11. [11]

    Boutros, M

    F. Boutros, M. Fang, M. Klemt, B. Fu, and N. Damer. CR-FIQA: face image quality assessment by learning sample relative classifiability. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 5836–

  12. [12]

    Carion, F

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, editors,Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, volume 12346 ofLecture Notes in Computer Science, pages 213–229. ...

  13. [13]

    J. Chen, Y . Deng, G. Bai, and G. Su. Face image quality assessment based on learning to rank.IEEE Signal Process. Lett., 22(1):90–94, 2015

  14. [14]

    Chettaoui, N

    T. Chettaoui, N. Damer, and F. Boutros. Froundation: Are foundation models ready for face recognition?Image Vis. Comput., 156:105453, 2025

  15. [15]

    Z. Dai, B. Cai, Y . Lin, and J. Chen. UP-DETR: unsupervised pre- training for object detection with transformers. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 1601–1610. Computer Vision Foundation / IEEE, 2021

  16. [16]

    J. Dan, Y . Liu, H. Xie, J. Deng, H. Xie, X. Xie, and B. Sun. Transface: Calibrating transformer training for face recognition from a data- centric perspective. InICCV, pages 20585–20596, 2023

  17. [17]

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 4690–4699. Computer Vision Foundation / IEEE, 2019

  18. [18]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview....

  19. [19]

    Eidinger, R

    E. Eidinger, R. Enbar, and T. Hassner. Age and gender estimation of unfiltered faces.IEEE Trans. Inf. Forensics Secur., 9(12):2170–2179, 2014

  20. [20]

    Best practice technical guidelines for automated border control (abc) systems, 2015

    Frontex. Best practice technical guidelines for automated border control (abc) systems, 2015

  21. [21]

    B. Fu, C. Chen, O. Henniger, and N. Damer. A deep insight into measuring face image utility with general and face-specific image quality metrics. InIEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 1121–1130. IEEE, 2022

  22. [22]

    Grother, M

    P. Grother, M. N. A. Hom, and K. Hanaoka. Ongoing face recognition vendor test (frvt) part 5: Face image quality assessment (4th draft). InNational Institute of Standards and Technology. Tech. Rep., Sep. 2021

  23. [23]

    Grother and E

    P. Grother and E. Tabassi. Performance of biometric quality measures. IEEE Trans. on Pattern Analysis and Machine Intelligence, 29(4):531– 543, Apr. 2007

  24. [24]

    Y . Han, G. Huang, S. Song, L. Yang, H. Wang, and Y . Wang. Dynamic neural networks: A survey.IEEE Trans. Pattern Anal. Mach. Intell., 44(11):7436–7456, 2022

  25. [25]

    Biometric quality: Re- view and application to face recognition with face- qnet

    J. Hernandez-Ortega, J. Galbally, J. Fi ´errez, and L. Beslay. Biometric quality: Review and application to face recognition with faceqnet. CoRR, abs/2006.03298, 2020

  26. [26]

    Hernandez-Ortega, J

    J. Hernandez-Ortega, J. Galbally, J. Fi ´errez, R. Haraksim, and L. Beslay. Faceqnet: Quality assessment for face recognition based on deep learning. In2019 International Conference on Biometrics, ICB 2019, Crete, Greece, June 4-7, 2019, pages 1–8. IEEE, 2019

  27. [27]

    G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. La- beled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007

  28. [28]

    Huang, Y

    Y . Huang, Y . Wang, Y . Tai, X. Liu, P. Shen, S. Li, J. Li, and F. Huang. Curricularface: Adaptive curriculum learning loss for deep face recognition. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 5900–5909. Computer Vision Foundation / IEEE, 2020

  29. [29]

    D. A. Hudson and L. Zitnick. Generative adversarial transformers. In M. Meila and T. Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pages 4487–4499. PMLR, 2021

  30. [30]

    ISO/IEC 19795-1:2021 Information technology — Biometric performance testing and reporting — Part 1: Principles and framework

    ISO/IEC JTC1 SC37 Biometrics. ISO/IEC 19795-1:2021 Information technology — Biometric performance testing and reporting — Part 1: Principles and framework. International Organization for Standardiza- tion, 2021

  31. [31]

    M. Kim, Y . Su, F. Liu, A. Jain, and X. Liu. Keypoint relative position encoding for face recognition. InCVPR, pages 244–255. IEEE, 2024

  32. [32]

    Knoche, S

    M. Knoche, S. H ¨ormann, and G. Rigoll. Cross-quality LFW: A database for analyzing cross- resolution image face recognition in unconstrained environments. In16th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2021, Jodhpur, India, December 15-18, 2021, pages 1–5. IEEE, 2021

  33. [33]

    J. N. Kolf, N. Damer, and F. Boutros. Grafiqs: Face image quality assessment using gradient magnitudes. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1490–1499, 2024

  34. [34]

    Lecun, L

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

  35. [35]

    C. Li, Y . Qiang, R. I. Sultan, H. Bagher-Ebadian, P. Khanduri, I. J. Chetty, and D. Zhu. Focalunetr: A focal transformer for boundary- aware prostate segmentation using CT images. In H. Greenspan, A. Madabhushi, P. Mousavi, S. E. Salcudean, J. Duncan, T. F. Syeda- Mahmood, and R. H. Taylor, editors,Medical Image Computing and Computer Assisted Interventi...

  36. [36]

    X. Liu, J. van de Weijer, and A. D. Bagdanov. Rankiqa: Learning from rankings for no-reference image quality assessment. InIEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1040–1049. IEEE Computer Society, 2017

  37. [37]

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10- 17, 2021, pages 9992–10002. IEEE, 2021

  38. [38]

    Matsubara, M

    Y . Matsubara, M. Levorato, and F. Restuccia. Split computing and early exiting for deep learning applications: Survey and research challenges.ACM Comput. Surv., 55(5):90:1–90:30, 2023

  39. [39]

    B. Maze, J. C. Adams, J. A. Duncan, N. D. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney, and 9 P. Grother. IARPA janus benchmark - C: face dataset and protocol. In 2018 International Conference on Biometrics, ICB 2018, Gold Coast, Australia, February 20-23, 2018, pages 158–165. IEEE, 2018

  40. [40]

    Q. Meng, S. Zhao, Z. Huang, and F. Zhou. Magface: A universal representation for face recognition and quality assessment. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 14225–14234. Computer Vision Foundation / IEEE, 2021

  41. [41]

    Mittal, A

    A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference image quality assessment in the spatial domain.IEEE Trans. Image Process., 21(12):4695–4708, 2012

  42. [42]

    Mittal, R

    A. Mittal, R. Soundararajan, and A. C. Bovik. Making a ”completely blind” image quality analyzer.IEEE Signal Process. Lett., 20(3):209– 212, 2013

  43. [43]

    Moschoglou, A

    S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou. Agedb: The first manually collected, in-the-wild age database. In2017 IEEE CVPRW, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1997–2005. IEEE Computer Society, 2017

  44. [44]

    Nixon, P

    S. Nixon, P. Ruiu, M. Cadoni, A. Lagorio, and M. Tistarelli. Exploiting face recognizability with early exit vision transformers. In2023 International Conference of the Biometrics Special Interest Group (BIOSIG), pages 1–7, 2023

  45. [45]

    F. Ou, X. Chen, R. Zhang, Y . Huang, S. Li, J. Li, Y . Li, L. Cao, and Y . Wang. SDD-FIQA: unsupervised face image quality assessment with similarity distribution distance. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 7670–7679. Computer Vision Foundation / IEEE, 2021

  46. [46]

    F.-Z. Ou, C. Li, S. Wang, and S. Kwong. Clib-fiqa: Face image quality assessment with confidence calibration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1694–1704, 2024

  47. [47]

    F.-Z. Ou, C. Li, S. Wang, and S. Kwong. Mr-fiqa: Face image quality assessment with multi-reference representations from synthetic data generation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12915–12925, October 2025

  48. [48]

    Ozgur, E

    G. Ozgur, E. Caldeira, T. Chettaoui, J. N. Kolf, M. Huber, N. Damer, and F. Boutros. Vitnt-fiqa: Training-free face image quality assessment with vision transformers. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 1203–1213, March 2026

  49. [49]

    Phuong and C

    M. Phuong and C. Lampert. Distillation-based training for multi- exit architectures. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 1355–1364. IEEE, 2019

  50. [50]

    L. Qin, M. Wang, C. Deng, K. Wang, X. Chen, J. Hu, and W. Deng. Swinface: A multi-task transformer for face recognition, expression recognition, age estimation and attribute estimation.IEEE Trans. Circuits Syst. Video Technol., 34(4):2223–2234, 2024

  51. [51]

    Raghu, T

    M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy. Do vision transformers see like convolutional neural networks? In M. Ranzato, A. Beygelzimer, Y . N. Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, Decembe...

  52. [52]

    Schlett, C

    T. Schlett, C. Rathgeb, O. Henniger, J. Galbally, J. Fi ´errez, and C. Busch. Face image quality assessment: A literature survey.ACM Comput. Surv., 54(10s):210:1–210:49, 2022

  53. [53]

    Schlett, C

    T. Schlett, C. Rathgeb, J. E. Tapia, and C. Busch. Considerations on the evaluation of biometric quality assessment algorithms.IEEE Trans. Biom. Behav. Identity Sci., 6(1):54–67, 2024

  54. [54]

    Sengupta, J

    S. Sengupta, J. Chen, C. D. Castillo, V . M. Patel, R. Chellappa, and D. W. Jacobs. Frontal to profile face verification in the wild. In2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016, Lake Placid, NY, USA, March 7-10, 2016, pages 1–9. IEEE Computer Society, 2016

  55. [55]

    Shi and A

    Y . Shi and A. K. Jain. Probabilistic face embeddings. In2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 6901–

  56. [56]

    Strudel, R

    R. Strudel, R. Garcia, I. Laptev, and C. Schmid. Segmenter: Trans- former for semantic segmentation. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 7242–7252. IEEE, 2021

  57. [57]

    Teerapittayanon, B

    S. Teerapittayanon, B. McDanel, and H. T. Kung. Branchynet: Fast inference via early exiting from deep neural networks. In23rd International Conference on Pattern Recognition, ICPR 2016, Canc´un, Mexico, December 4-8, 2016, pages 2464–2469. IEEE, 2016

  58. [58]

    Terh ¨orst, J

    P. Terh ¨orst, J. N. Kolf, N. Damer, F. Kirchbuchner, and A. Kuijper. SER-FIQ: unsupervised estimation of face image quality based on stochastic embedding robustness. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 5650–5659. Computer Vision Foundation / IEEE, 2020

  59. [59]

    Touvron, M

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou. Training data-efficient image transformers & distillation through attention. In M. Meila and T. Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, page...

  60. [60]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2...

  61. [61]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need.CoRR, abs/1706.03762, 2017

  62. [62]

    H. Wang, Y . Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 5265–5274. IEEE Computer Society, 2018

  63. [63]

    J. Xin, R. Tang, Y . Yu, and J. Lin. Berxit: Early exiting for BERT with better fine-tuning and extension to regression. In P. Merlo, J. Tiedemann, and R. Tsarfaty, editors,Proceedings of the 16th Con- ference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 91–104. Associ...

  64. [64]

    G. Xu, J. Hao, L. Shen, H. Hu, Y . Luo, H. Lin, and J. Shen. Lgvit: Dynamic early exiting for accelerating vision transformer. In A. El- Saddik, T. Mei, R. Cucchiara, M. Bertini, D. P. T. Vallejo, P. K. Atrey, and M. S. Hossain, editors,Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 Nove...

  65. [65]

    Zheng and W

    T. Zheng and W. Deng. Cross-pose lfw: A database for studying cross-pose face recognition in unconstrained environments. Technical Report 18-01, Beijing University of Posts and Telecommunications, February 2018

  66. [66]

    Cross-Age LFW: A Database for Studying Cross-Age Face Recognition in Unconstrained Environments

    T. Zheng, W. Deng, and J. Hu. Cross-age LFW: A database for studying cross-age face recognition in unconstrained environments. CoRR, abs/1708.08197, 2017. 10 APPENDIX This supplementary material provides comprehensive ex- perimental results and detailed analysis of theEX-FIQA method for face image quality assessment. The supple- mentary material is struct...

  67. [67]

    [43] [54] [27] [66] [65] [32] [39] 1e−3 1e−4 1e−3 1e−4 1e−3 1e−4 1e−3 1e−4 1e−3 1e−4 1e−3 1e−4 1e−3 1e−4 1e−3 1e−4 ArcFace[17] IQA BRISQUE[41] 0.0565 0.1285 0.0400 0.0585 0.0343 0.0433 0.0043 0.0049 0.0755 0.0813 0.2558 0.3037 0.6680 0.7122 0.0381 0.0656 RankIQA[36] 0.0400 0.0933 0.0372 0.0523 0.0301 0.0384 0.0039 0.0045 0.0846 0.0915 0.2437 0.2969 0.6584...

  68. [68]

    The best exits,EX-FIQAand proposed fusionEX-FIQA-FWare marked with solid lines

    FR models. The best exits,EX-FIQAand proposed fusionEX-FIQA-FWare marked with solid lines. 21 0.00 0.2 0.4 0.6 XQLFW ArcFace ElasticFace MagFace CurricularFace 0.00 0.02 0.04 0.06Adience 0.00 0.02 0.04AgeDB-30 0.00 0.05 0.1 CALFW 0.00 0.02 0.04CFP-FP 0.00 0.1 0.2 CPLFW 0.00 0.00 0.00 0.01 LFW 0 0.4 0.8 0.00 0.01 0.02 0.03IJB-C 0 0.4 0.8 0 0.4 0.8 0 0.4 0....

  69. [69]

    The best exits,EX-FIQAand proposed fusionEX-FIQA-FWare marked with solid lines

    FR models. The best exits,EX-FIQAand proposed fusionEX-FIQA-FWare marked with solid lines. 22 0.00 0.2 0.5 0.8 XQLFW ArcFace ElasticFace MagFace CurricularFace 0.00 0.05 0.1 0.2 Adience 0.00 0.02 0.04 0.06AgeDB-30 0.00 0.05 0.1 CALFW 0.00 0.02 0.04 0.06CFP-FP 0.00 0.1 0.2 0.3 CPLFW 0.00 0.00 0.00 0.01 LFW 0 0.4 0.8 0.00 0.02 0.04IJB-C 0 0.4 0.8 0 0.4 0.8 ...

  70. [70]

    The best exits,EX-FIQAand proposed fusionEX-FIQA-FWare marked with solid lines

    FR models. The best exits,EX-FIQAand proposed fusionEX-FIQA-FWare marked with solid lines. 23