arxiv: 2604.10106 · v1 · submitted 2026-04-11 · 💻 cs.CV

Recognition: unknown

VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction

Kostas Daniilidis, Panagiotis P. Filntisis, Petros Maragos, Vasiliki Vasileiou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords head pose estimationrelative pose predictionsynthetic data trainingBIWI benchmarkgeometry foundation modelrigid transformationmonocular vision

0 comments

The pith

Reframing head pose estimation as relative rigid transformation prediction enables state-of-the-art results on real benchmarks using only synthetic training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional head pose estimation regresses an absolute pose directly from a single image, requiring the network to learn an implicit reference frame tied to the training data. This paper instead formulates the task as predicting the relative rigid transformation between two head configurations, one of which serves as an explicit anchor with known pose. VGGT-HPE implements this using a geometry foundation model fine-tuned only on synthetic facial images, allowing the anchor to be chosen dynamically at test time to match the application's needs. If correct, this makes the problem easier, particularly for extreme poses, and eliminates dependence on real-world labeled data while still surpassing prior methods on the BIWI benchmark.

Core claim

We argue that predicting the relative rigid transformation between two observed head configurations is a fundamentally easier and more robust formulation than direct absolute regression from a single image. VGGT-HPE, a relative head pose estimator built upon a general-purpose geometry foundation model and finetuned exclusively on synthetic facial renderings, achieves state-of-the-art results on the BIWI benchmark despite zero real-world training data, outperforming established absolute regression methods.

What carries the argument

The relative rigid transformation estimator in VGGT-HPE that computes geometric displacement from an explicitly provided anchor image whose pose is known.

If this is right

The advantage of relative over absolute prediction grows with the difficulty of the target pose.
Training exclusively on synthetic data suffices to outperform real-data-trained absolute methods on real benchmarks.
Test-time selection of the anchor frame allows control over prediction difficulty, such as using near-neutral poses.
Controlled easy- and hard-pair benchmarks confirm that relative prediction is intrinsically more accurate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This relative formulation could extend to other monocular pose tasks such as body or hand estimation by reusing the same anchor-based mechanism.
In video streams, selecting temporally adjacent frames as anchors would likely improve robustness to motion and occlusion.
Dynamic anchor choice based on estimated pose extremity could create adaptive systems that trade compute for accuracy on demand.

Load-bearing premise

That the relative rigid transformation between two head configurations is easier and more robust to predict than the absolute pose from one image.

What would settle it

Demonstrating that an absolute regression model trained on the same synthetic data achieves higher accuracy than VGGT-HPE on the BIWI hard-pair cases would falsify the claim that relative prediction is intrinsically superior.

Figures

Figures reproduced from arXiv: 2604.10106 by Kostas Daniilidis, Panagiotis P. Filntisis, Petros Maragos, Vasiliki Vasileiou.

**Figure 2.** Figure 2: Overview of the VGGT-HPE architecture. The model takes an anchor image (Ia) and a query image (Iq) to predict the relative rigid transformation (Tq←a) between them. We use a frozen, pre-trained VGGT backbone efficiently fine-tuned with LoRA for the facial domain. The final absolute query pose (Tˆq) is recovered by composing the predicted relative displacement with the known anchor pose. composed at test ti… view at source ↗

**Figure 3.** Figure 3: Samples from our synthetic training set. Each row shows [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on BIWI. Each row shows a different subject. From left to right: the query frame, the anchor frame with its [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: BIWI neutral-anchor evaluation as a function of anchor [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: BIWI query-pose evaluation as a function of absolute [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Monocular head pose estimation is traditionally formulated as direct regression from a single image to an absolute pose. This paradigm forces the network to implicitly internalize a dataset-specific canonical reference frame. In this work, we argue that predicting the relative rigid transformation between two observed head configurations is a fundamentally easier and more robust formulation. We introduce VGGT-HPE, a relative head pose estimator built upon a general-purpose geometry foundation model. Finetuned exclusively on synthetic facial renderings, our method sidesteps the need for an implicit anchor by reducing the problem to estimating a geometric displacement from an explicitly provided anchor with a known pose. As a practical benefit, the relative formulation also allows the anchor to be chosen at test time - for instance, a near-neutral frame or a temporally adjacent one - so that the prediction difficulty can be controlled by the application. Despite zero real-world training data, VGGT-HPE achieves state-of-the-art results on the BIWI benchmark, outperforming established absolute regression methods trained on mixed and real datasets. Through controlled easy- and hard-pair benchmarks, we also systematically validate our core hypothesis: relative prediction is intrinsically more accurate than absolute regression, with the advantage scaling alongside the difficulty of the target pose. Project page and code: https://vasilikivas.github.io/VGGT-HPE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reframes monocular head pose estimation as relative rigid transformation prediction between a pair of images rather than absolute regression from one image. VGGT-HPE fine-tunes a general-purpose geometry foundation model exclusively on synthetic facial renderings and claims state-of-the-art results on the BIWI benchmark, outperforming absolute methods trained on real or mixed data. It further reports controlled easy- and hard-pair experiments showing that relative prediction accuracy exceeds absolute regression, with the gap widening for difficult target poses.

Significance. If the results hold under a fair protocol, the work offers a practical alternative to absolute HPE that exploits synthetic data and foundation models, potentially lowering annotation requirements while improving robustness. The controlled pair-wise benchmarks provide direct empirical support for the core hypothesis that relative displacement is intrinsically easier than absolute regression.

major comments (2)

[§4] §4 (BIWI benchmark protocol): The manuscript must explicitly state how the anchor image is selected at test time and whether its ground-truth pose is provided to the network. If ground-truth anchors are used, the reported SOTA and outperformance over absolute baselines may be attributable to this oracle reference rather than to the relative formulation or successful synthetic-to-real transfer.
[§4.3] §4.3 (controlled easy/hard-pair benchmarks): The construction of hard pairs (e.g., selection criteria for pose difference thresholds or image pairs) is not described in sufficient detail to allow reproduction or to confirm that the difficulty scaling result is not an artifact of pair selection.

minor comments (2)

[§1] The abstract and §1 claim 'zero real-world training data' yet the foundation model itself was pretrained on large-scale real data; a brief clarification of this distinction would avoid potential misinterpretation.
[Tables 1-3] Results tables lack error bars, number of evaluation runs, or statistical significance tests, which would strengthen the SOTA claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed and constructive feedback. We address the two major comments below and will incorporate the necessary clarifications into the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (BIWI benchmark protocol): The manuscript must explicitly state how the anchor image is selected at test time and whether its ground-truth pose is provided to the network. If ground-truth anchors are used, the reported SOTA and outperformance over absolute baselines may be attributable to this oracle reference rather than to the relative formulation or successful synthetic-to-real transfer.

Authors: We agree that the manuscript should explicitly describe the test-time anchor selection and the use of ground-truth poses. In VGGT-HPE, the anchor image is selected at test time (e.g., a near-neutral frame or temporally adjacent image as noted in the abstract), and its ground-truth pose is provided as input to compute the absolute pose of the target via the predicted relative transformation. The network receives only the image pair and outputs the relative pose; it does not receive pose values as input. This is not an 'oracle' in the sense of providing privileged information to the model during prediction—the model must still learn to estimate the geometric displacement accurately from images alone. The known anchor pose is a natural part of the relative formulation, allowing the application to choose an easy reference. For the BIWI results, we will add a precise description of how anchors were chosen in the benchmark protocol. We maintain that the outperformance stems from the relative formulation and synthetic training, as supported by the controlled experiments where relative prediction outperforms absolute regression under matched conditions. revision: yes
Referee: [§4.3] §4.3 (controlled easy/hard-pair benchmarks): The construction of hard pairs (e.g., selection criteria for pose difference thresholds or image pairs) is not described in sufficient detail to allow reproduction or to confirm that the difficulty scaling result is not an artifact of pair selection.

Authors: We acknowledge that the details on constructing the easy- and hard-pair benchmarks in §4.3 are insufficient for full reproducibility. We will revise this section to include the exact pose difference thresholds, the criteria for selecting image pairs from the dataset, and any other parameters used to define 'easy' and 'hard' pairs. This will allow readers to reproduce the experiments and verify that the observed accuracy gap scaling with difficulty is not due to biased pair selection. The controlled nature of these benchmarks was intended to isolate the effect of the relative vs. absolute formulation, and we believe the additional details will strengthen this validation. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical reformulation and benchmark evaluation

full rationale

The paper reframes monocular head pose estimation as relative rigid transformation prediction between image pairs, trains VGGT-HPE exclusively on synthetic renderings, and reports BIWI results via direct empirical comparison to absolute regression baselines. No load-bearing derivation, equation, or 'prediction' reduces by construction to fitted parameters, self-citations, or ansatzes; the core hypothesis is validated through separate easy/hard-pair controlled benchmarks rather than by re-expressing inputs. The evaluation setup (anchor image with known pose) is explicitly described and does not create self-definitional equivalence in the claimed results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that relative rigid transformations are easier to predict than absolute poses and that a pre-trained geometry foundation model transfers from synthetic faces to real images; no new free parameters or invented entities are introduced beyond standard fine-tuning.

axioms (1)

domain assumption Rigid transformations between head configurations can be recovered from image pairs using a geometry foundation model.
Invoked in the relative pose prediction setup and the claim that this is fundamentally easier than absolute regression.

pith-pipeline@v0.9.0 · 5549 in / 1302 out tokens · 55252 ms · 2026-05-10T16:00:15.386861+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 4 canonical work pages

[1]

Abate, Carmen Bisogni, Arcangelo Castiglione, and Michele Nappi

Andrea F. Abate, Carmen Bisogni, Arcangelo Castiglione, and Michele Nappi. Head pose estimation: An extensive survey on recent techniques and applications.Pattern Recog- nition, 127:108591, 2022. 1

2022
[2]

img2pose: Face alignment and detection via 6DoF, face pose estimation

V ´ıtor Albiero, Xingyu Chen, Xi Yin, Guan Pang, and Tal Hassner. img2pose: Face alignment and detection via 6DoF, face pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 5

2021
[3]

Blender – a 3D modelling and rendering package

Blender Online Community. Blender – a 3D modelling and rendering package. Blender Foundation, 2018. 4

2018
[4]

Towards unbiased label distribution learning for facial pose estimation using anisotropic spherical Gaussian

Zhiwen Cao, Dongfang Liu, Qijun Wang, and Yingjie Chen. Towards unbiased label distribution learning for facial pose estimation using anisotropic spherical Gaussian. InECCV. 5
[5]

A vector-based representation to enhance head pose estimation

Zhiwen Cao, Zongcheng Chu, Dongfang Liu, and Yingjie Chen. A vector-based representation to enhance head pose estimation. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021. 4, 5

2021
[6]

6DoF head pose esti- mation through explicit bidirectional interaction with face geometry

Sungho Chun and Ju Yong Chang. 6DoF head pose esti- mation through explicit bidirectional interaction with face geometry. InEuropean Conference on Computer Vision (ECCV), 2024. 2, 5, 7

2024
[7]

On the representation and methodology for wide and short range head pose estimation.Pattern Recog- nition, 149:110263, 2024

Alejandro Cobo, Roberto Valle, Jos ´e M Buenaposada, and Luis Baumela. On the representation and methodology for wide and short range head pose estimation.Pattern Recog- nition, 149:110263, 2024. 5

2024
[8]

RetinaFace: Single-shot multi- level face localisation in the wild

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kot- sia, and Stefanos Zafeiriou. RetinaFace: Single-shot multi- level face localisation in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5

2020
[9]

Lsd- slam: Large-scale direct monocular slam

Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InEuropean con- ference on computer vision, pages 834–849. Springer, 2014. 2

2014
[10]

Random forests for real time 3D face analysis.International Journal of Computer Vision, 101:437–458, 2013

Gabriele Fanelli, Matthias Dantone, Juergen Gall, Andrea Fossati, and Luc Van Gool. Random forests for real time 3D face analysis.International Journal of Computer Vision, 101:437–458, 2013. 2, 4

2013
[11]

Filntisis, George Retsinas, Radek Dane ˇcek, Vanessa Sklyarova, Petros Maragos, and Timo Bolkart

Panagiotis P. Filntisis, George Retsinas, Radek Dane ˇcek, Vanessa Sklyarova, Petros Maragos, and Timo Bolkart. MOCHI: Registration-free learnable multi-view capture of faces in dense semantic correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 4

2026
[12]

Dynamic facial analysis: From bayesian filtering to recurrent neural network

Jinwei Gu, Xiaodong Yang, Shalini De Mello, and Jan Kautz. Dynamic facial analysis: From bayesian filtering to recurrent neural network. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 1548–1557, 2017. 2

2017
[13]

Abdelrahman, and Ayoub Al- Hamadi

Thorsten Hempel, Ahmed A. Abdelrahman, and Ayoub Al- Hamadi. 6D rotation representation for unconstrained head pose estimation. InIEEE International Conference on Image Processing (ICIP), pages 2496–2500, 2022. 1, 2, 4, 5, 7

2022
[14]

QuatNet: Quaternion-based head pose estimation with multiregression loss.IEEE Transactions on Multimedia, 21(4):1035–1046, 2018

Hao-Wei Hsu, Ting-Yang Wu, Shen Wan, Wing Hung Wong, and Chen-Yi Lee. QuatNet: Quaternion-based head pose estimation with multiregression loss.IEEE Transactions on Multimedia, 21(4):1035–1046, 2018. 5

2018
[15]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shanen Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shanen Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations (ICLR),
[16]

Improving head pose estimation using two-stage ensembles with top-k regression

Bin Huang, Renwen Chen, Wang Xu, and Qinbang Zhou. Improving head pose estimation using two-stage ensembles with top-k regression. page 103827, 2020. 5

2020
[17]

Estimation of driver’s gaze region from head pose using a single RGB camera.IEEE Transactions on Intelligent Transportation Systems, 23(10): 17907–17918, 2022

Sumit Jha and Carlos Busso. Estimation of driver’s gaze region from head pose using a single RGB camera.IEEE Transactions on Intelligent Transportation Systems, 23(10): 17907–17918, 2022. 1

2022
[18]

Toward 3D face reconstruction in perspective projection: Estimating 6DoF face pose from monocular image.IEEE Transactions on Image Processing, 32:3080–3091, 2023

Yueying Kao, Bowen Pan, Miao Xu, Jiangjing Lyu, Xi- angyu Zhu, Yanbo Chang, Xiaobo Li, and Zhen Lei. Toward 3D face reconstruction in perspective projection: Estimating 6DoF face pose from monocular image.IEEE Transactions on Image Processing, 32:3080–3091, 2023. 5

2023
[19]

One millisecond face alignment with an ensemble of regression trees

Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of regression trees. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. 5

2014
[20]

Quantitative Survey of the State of the Art in Sign Language Recognition.arXiv preprint arXiv:2008.09918,

Oscar Koller. Quantitative survey of the state of the art in sign language recognition.arXiv preprint arXiv:2008.09918,

work page arXiv 2008
[21]

Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework.arXiv preprint arXiv:2103.15792, 2021

Dimitrios Kollias and Stefanos Zafeiriou. Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework.arXiv preprint arXiv:2103.15792, 2021. 1

work page arXiv 2021
[22]

Domain adaptation for head pose estimation using relative pose consistency

Lukas Kuhn, Markus M ¨uller, et al. Domain adaptation for head pose estimation using relative pose consistency. InPro- ceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), 2023. 2

2023
[23]

Grounding image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and J ´erome Revaud. MASt3R: Matching and stereo 3d reconstruction.arXiv preprint arXiv:2406.09756, 2024. 3

work page arXiv 2024
[24]

Black, Hao Li, and Javier Romero

Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans. InACM SIGGRAPH Asia, 2017. 2, 4

2017
[25]

Facial pose estimation by deep learning from label distributions

Zhiwen Liu, Zhihang Chen, Jing Bai, Shanshan Li, and Shiguo Lian. Facial pose estimation by deep learning from label distributions. InIEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2019. 5

2019
[26]

Pose estimation for augmented reality: A hands-on survey

Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. Pose estimation for augmented reality: A hands-on survey. IEEE Transactions on Visualization and Computer Graph- ics, 22(12):2633–2651, 2016. 1

2016
[27]

A review of verbal and non-verbal human–robot interactive communication.Robotics and Au- tonomous Systems, 63:22–35, 2015

Nikolaos Mavridis. A review of verbal and non-verbal human–robot interactive communication.Robotics and Au- tonomous Systems, 63:22–35, 2015. 1

2015
[28]

Head pose estimation in computer vision: A survey.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 31(4): 607–626, 2009

Erik Murphy-Chutorian and Mohan Manubhai Trivedi. Head pose estimation in computer vision: A survey.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 31(4): 607–626, 2009. 1

2009
[29]

Nataniel Ruiz, Eunji Chong, and James M. Rehg. Fine- grained head pose estimation without keypoints. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018. 1, 2, 4, 5

2018
[30]

Black, and Justus Thies

Vanessa Sklyarova, Egor Zakharov, Otmar Hilliges, Michael J. Black, and Justus Thies. Text-conditioned gen- erative model of 3D strand-based human hairstyles. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4703–4712, 2024. 4

2024
[31]

DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. In Advances in Neural Information Processing Systems, pages 7166–7177, 2021. 2

2021
[32]

Buenaposada, and Luis Baumela

Roberto Valle, Jos ´e M. Buenaposada, and Luis Baumela. Multi-task head pose estimation in-the-wild.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 43(8): 2874–2881, 2020. 5

2020
[33]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025. 2, 3, 4

2025
[34]

DUSt3R: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and J ´erome Revaud. DUSt3R: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 3

2024
[35]

Tartanvo: A generalizable learning-based vo

Wenshan Wang, Yaoyu Zhu, Xin Wang, Yuwei Zeng, and Mingyu Ding. Tartanvo: A generalizable learning-based vo. InConference on Robot Learning, pages 1761–1772. PMLR,
[36]

Temporal modeling and structure aggre- gation for video head pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

Xinyu Wang et al. Temporal modeling and structure aggre- gation for video head pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. 2

2022
[37]

CroCo: Cross-view completion for 3d vision

Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gr´egory Rogez, and J´erome Revaud. CroCo: Cross-view completion for 3d vision. InAdvances in Neural Information Processing Systems, pages 17424– 17438, 2022. 3

2022
[38]

EV A-GCN: Head pose estimation based on graph convolutional net- works

Miao Xin, Shuangtao Mo, and Yaoyang Lin. EV A-GCN: Head pose estimation based on graph convolutional net- works. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 5

2021
[39]

FSA-Net: Learning fine-grained structure aggre- gation for head pose estimation from a single image

Tsun-Yi Yang, Yi-Ting Chen, Yin-Yu Lin, and Yung-Yu Chuang. FSA-Net: Learning fine-grained structure aggre- gation for head pose estimation from a single image. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1087–1096, 2019. 4, 5

2019
[40]

TokenHPE: Learning orientation tokens for effi- cient head pose estimation via transformers

Cheng Zhang, Hai Liu, Yongjian Deng, Bochen Xie, and Youfu Li. TokenHPE: Learning orientation tokens for effi- cient head pose estimation via transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 8897–8906, 2023. 1, 2, 5, 7

2023
[41]

FDN: Feature decoupling network for head pose estimation

Hao Zhang, Mingyi Wang, Yonggenui Liu, and Yi Yuan. FDN: Feature decoupling network for head pose estimation. InAAAI Conference on Artificial Intelligence, 2020. 5

2020
[42]

Joint face detection and alignment using multitask cascaded convolutional networks.IEEE Signal Processing Letters, 23 (10):1499–1503, 2016

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks.IEEE Signal Processing Letters, 23 (10):1499–1503, 2016. 4, 5

2016
[43]

MPIIGaze: Real-world dataset and deep appearance-based gaze estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1):162–175,

Xucong Zhang, Yusuke Sugano, Mario Fritz, and An- dreas Bulling. MPIIGaze: Real-world dataset and deep appearance-based gaze estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1):162–175,
[44]

WHENet: Real-time fine- grained estimation for wide range head pose.arXiv preprint arXiv:2005.10353, 2020

Yijun Zhou and James Gregson. WHENet: Real-time fine- grained estimation for wide range head pose.arXiv preprint arXiv:2005.10353, 2020. 1, 2, 4, 5

work page arXiv 2005
[45]

Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z. Li. Face alignment across large poses: A 3D solution. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 146–155, 2016. 2, 5

2016