arxiv: 2605.12498 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.GR

Recognition: no theorem link

EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

Alain Pagani, Christen Millerdurai, Didier Stricker, Shaoxiang Wang, Vladislav Golyanik, Yaxu Xie

Pith reviewed 2026-05-13 05:19 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords egocentric hand pose3D reconstructionmonocular cameraforearm guidancetransformerabsolute poseAR/VRdepth ambiguity

0 comments

The pith

EgoForce recovers absolute 3D hand poses from a monocular egocentric camera by guiding with a differentiable forearm representation and unified transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops EgoForce to reconstruct the absolute 3D pose and shape of hands from the viewpoint of a single head-mounted camera. Previous monocular methods suffer from depth-scale ambiguity and require retraining for each new camera type used in head-mounted devices. EgoForce introduces a forearm representation that stabilizes the hand pose prediction and a transformer that jointly models arm and hand geometry from the egocentric image. A ray space solver then computes the absolute 3D positions that remain consistent across fisheye, perspective, and wide-FOV cameras. If this holds, it reduces the cost and effort of creating device-specific datasets for practical AR and VR hand interactions.

Core claim

EgoForce is a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's camera-space viewpoint. It achieves this across fisheye, perspective, and distorted wide-FOV camera models with a single unified network by combining a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry, and a ray space closed-form solver that enables absolute 3D pose recovery.

What carries the argument

Differentiable forearm representation integrated into a unified arm-hand transformer, together with a ray space closed-form solver, that resolves depth-scale ambiguity to enable absolute 3D recovery.

If this is right

State-of-the-art accuracy on egocentric 3D hand pose benchmarks.
Up to 28% reduction in camera-space MPJPE on the HOT3D dataset compared to prior methods.
Consistent results across fisheye, perspective, and distorted wide-FOV camera configurations.
Elimination of the need for costly device-specific training datasets for new head-mounted devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Consumer AR/VR devices could deploy hand tracking more easily without collecting large custom datasets for each hardware variant.
Similar forearm guidance might improve other monocular egocentric estimations such as full-body or object pose tracking.
Integration with existing VR systems could enable more natural hand-centric interactions in telepresence without additional sensors.

Load-bearing premise

That the forearm representation and arm-hand transformer together provide sufficient information to resolve depth-scale ambiguity for accurate absolute 3D hand poses across diverse camera models.

What would settle it

An experiment where the reported MPJPE reduction on HOT3D is not observed or where performance degrades significantly on a previously unseen head-mounted camera model.

Figures

Figures reproduced from arXiv: 2605.12498 by Alain Pagani, Christen Millerdurai, Didier Stricker, Shaoxiang Wang, Vladislav Golyanik, Yaxu Xie.

**Figure 1.** Figure 1: EgoForce reconstructs the absolute 3D pose and shape of the hands from the user’s viewpoint using a monocular RGB camera from Aria glasses (top left). With a unified framework, it supports diverse camera models while producing accurate 3D hand pose and shape (bottom), and recovers the absolute 3D hand position in the egocentric frame (top right), enabling metrically meaningful, viewpoint-consistent 3D trac… view at source ↗

**Figure 2.** Figure 2: EgoForce processes a monocular egocentric RGB frame by extracting hand and forearm crops, tokenizing them, and conditioning the features on crop intrinsics (CIT). A transformer jointly infers hand–arm features to predict 2D keypoints (with confidences) and root-relative 3D hand and arm poses, which are lifted to camera-space meshes via the ray space solver. When the forearm is out of view, arm tokens are r… view at source ↗

**Figure 3.** Figure 3: The Ray Space Solver is a cross-camera (calibration-conditioned) module that recovers camera-space translation from 2D–3D correspondences, enabling deployment across devices with different optics. derivation of the solver is provided in Sup. Sec. 7.3, and Kalman filter details and hyperparameters are reported in Sup. Sec. 7.3.1. 3.4 Loss Functions 2D Heatmap Loss. Squared error between the predicted and g… view at source ↗

**Figure 4.** Figure 4: Influence of arm on hand-joint occlusion accuracy (ARCTIC dataset). Adding the arm consistently improves hand pose (RS-MJE), camera-space accuracy (CS-MJE), and temporal stability (RS-ACC, CS-ACC). HOT3D. HOT3D is challenging due to (1) large-range hand motion during object interaction and (2) severe fisheye distortion combined with wide-FOV imagery, which amplifies depth ambiguity. EgoForce achieves the… view at source ↗

**Figure 5.** Figure 5: Camera-space results on HOT3D. Left: egocentric input with the predicted 2D joint projections. Right: predicted meshes (left red, right blue) and ground-truth meshes (gray) in camera space [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Influence of the variational arm prior. Without the variational prior, the forearm is often mislocalized when it is heavily occluded. With the prior, the model infers a plausible forearm pose; in this example, the forearm is entirely out of view, yet the predicted position and orientation closely match ground truth. improves from 43.9 to 39.3 𝑚𝑚 at 50% intrinsic noise, despite a camera-geometry error of 25… view at source ↗

**Figure 6.** Figure 6: Influence of arm input. Providing the arm crop as an input to the network improves hand pose accuracy. In this example, the right hand is strongly occluded by the phone and the other hand, yet the model recovers a plausible 3D pose, with accurate 2D joint reprojections and a hand-arm mesh closely aligned to ground truth. Depth–scale mitigation and hand-scale stability. Tab. 8 quantitatively shows that for… view at source ↗

**Figure 8.** Figure 8: Qualitative camera-space results on egocentric datasets. We compare our method against three state-of-the-art camera-space 3D hand pose methods on three datasets with widely different camera intrinsics. Predicted left and right limb meshes are shown in red and blue, respectively, with ground truth highlighted in gray. SIGGRAPH Conference Papers ’26, July 19–23, 2026, Los Angeles, CA, USA [PITH_FULL_IMAGE:… view at source ↗

**Figure 9.** Figure 9: Camera-space hand mesh projections on egocentric datasets. We project predicted hand meshes onto images from three camera types: HOT3D (fisheye), H2O/HO3D (perspective), and ARCTIC (distorted perspective). Our method maintains accurate projections under challenging conditions such as motion blur (H2O) and hand-object occlusions (HOT3D, HO3D, ARCTIC). SIGGRAPH Conference Papers ’26, July 19–23, 2026, Los An… view at source ↗

**Figure 10.** Figure 10: Camera-space hand-arm mesh projections on egocentric datasets. We project predicted hand and arm meshes onto images from three camera types: HOT3D (fisheye), H2O (perspective), and ARCTIC (distorted perspective). Our method maintains accurate projections under challenging conditions such as motion blur (H2O) and hand-object occlusions (HOT3D, ARCTIC). SIGGRAPH Conference Papers ’26, July 19–23, 2026, Los … view at source ↗

**Figure 11.** Figure 11: Influence of undistortion on input crops. Direct hand-arm crops from the raw fisheye image lead to large errors as fisheye pixels correspond to highly non-linear viewing rays. Rectifying the full frame to a single perspective view reduces distortion but introduces strong peripheral warping and resampling artifacts that amplify localization noise. In contrast, lensmodel undistortion preserves the correct … view at source ↗

**Figure 12.** Figure 12: Influence of Crop Intrinsics Tokens (CIT). CIT encodes cropspecific intrinsics as tokens for the hand-arm crop inputs feed to the transformer. This enables explicit local camera-geometry reasoning and reduces camera-space mesh error, leading to closer alignment with ground truth. 7 Additional Details about our Framework 7.1 ForeArm Representation Model The ForeArm Representation Model (FARM) is a lightw… view at source ↗

**Figure 14.** Figure 14: Unified hand–arm mesh. We attach the FARM at the MANO wrist and apply a small elbow-direction offset to avoid overlap and ensure a clean, anatomically consistent connection. 7.2.1 Crop Intrinsics & Distortion Correction. Let the camera intrinsics be 𝐾 = © « 𝑓𝑥 0 𝑐𝑥 0 𝑓𝑦 𝑐𝑦 0 0 1 ª ® ¬ , where 𝑓𝑥 , 𝑓𝑦 are the focal lengths (in pixels) and (𝑐𝑥 , 𝑐𝑦) is the principal point. Let 𝑑 = (𝑑1, . . . , 𝑑𝑚) denot… view at source ↗

**Figure 15.** Figure 15: Fusing CIT and Crop Tokens. For each crop (hand/arm), its CIT is broadcast to all patch tokens for that crop and fused in the Combine block via feature concatenation followed by a learnable projection, with a residual addition of the original token embedding. This injects crop-specific geometric context into every patch feature while allowing the model to fall back to the original mapping when the conditi… view at source ↗

**Figure 16.** Figure 16: Right-hand camera-space trajectory for a HOT3D sequence. Our approach produces a more accurate hand trajectory in camera space, particularly along the depth (z-axis), compared to competing approaches. We visualize 160 frames from the sequence. HO3D In-the-wild (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative results on HO3D and in-the-wild data. Our approach produces accurate hand pose estimates even under hand–object occlusions on HO3D (a), and it generalizes to in-the-wild videos despite not being explicitly trained on those data distributions (b) [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Robustness to calibration mismatch on HOT3D. As cameraintrinsic perturbation increases, CS-MJE remains stable and even improves slightly under moderate mismatch, despite increasing camera-geometry error; performance degrades clearly only under large mismatches, indicating robustness to moderate calibration error. stable over a broad range and is best at 50%, showing robustness to intrinsic mismatch and s… view at source ↗

**Figure 19.** Figure 19: Qualitative camera-space results on egocentric datasets. We compare our method against UmeTrack [Han et al. 2022] on three datasets with widely different camera intrinsics. Predicted left and right limb meshes are shown in red and blue, respectively, with ground truth highlighted in gray [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

read the original abstract

Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces EgoForce, a monocular framework for absolute 3D hand pose and shape reconstruction from egocentric RGB images captured by head-mounted cameras. It proposes a differentiable forearm representation to stabilize pose, a unified arm-hand transformer to predict hand and forearm geometry jointly, and a ray-space closed-form solver to recover absolute 3D coordinates. The method claims to operate across fisheye, perspective, and distorted wide-FOV models using a single network without device-specific training data. Experiments on three egocentric benchmarks report state-of-the-art camera-space MPJPE, with up to 28% reduction on HOT3D and consistent cross-configuration performance.

Significance. If the cross-camera robustness and absolute recovery claims hold, the work would be significant for practical AR/VR and egocentric interaction systems by lowering the barrier of device-specific data collection. The forearm-guided stabilization and ray-space solver represent a concrete attempt to address depth-scale ambiguity in a unified manner, which could influence future monocular egocentric pipelines if the technical details are clarified.

major comments (2)

[Abstract / Methods] Abstract and methods (ray-space solver description): The claim that the closed-form ray-space solver enables absolute 3D recovery across fisheye, perspective, and wide-FOV models without per-device training is load-bearing for the generalization result, yet the abstract provides no explicit mechanism for incorporating nonlinear distortion (e.g., equidistant or polynomial models) into the solver. This leaves open whether the solver assumes known intrinsics per model or relies on implicit learning that would require device-specific data, directly affecting the weakest assumption identified in the review.
[Experiments] Experiments section: The reported up to 28% MPJPE reduction on HOT3D and consistent performance across camera configurations are central to the SOTA claim, but the abstract lacks ablations isolating the forearm representation and unified transformer contributions, as well as error bars or training data details. Without these, it is not possible to confirm that the gains are not due to post-hoc tuning or dataset-specific factors, undermining verification of the cross-configuration robustness.

minor comments (1)

[Abstract] The project page link is provided but no supplementary material or code release is mentioned in the abstract; including a link to reproducible implementation would strengthen the submission.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments help clarify key aspects of our claims regarding generalization and experimental validation. We address each major comment point by point below with explanations and commitments to revisions where they improve the manuscript.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and methods (ray-space solver description): The claim that the closed-form ray-space solver enables absolute 3D recovery across fisheye, perspective, and wide-FOV models without per-device training is load-bearing for the generalization result, yet the abstract provides no explicit mechanism for incorporating nonlinear distortion (e.g., equidistant or polynomial models) into the solver. This leaves open whether the solver assumes known intrinsics per model or relies on implicit learning that would require device-specific data, directly affecting the weakest assumption identified in the review.

Authors: We appreciate the referee highlighting the need for explicit clarification on this central mechanism. The ray-space solver is a closed-form geometric method that takes the network's predictions (2D image-plane locations of hand joints and forearm parameters) and lifts them to absolute camera-space 3D coordinates by casting rays according to the camera's intrinsic model. This explicitly incorporates nonlinear distortion parameters (equidistant fisheye, polynomial, or perspective) using the known intrinsics provided at inference time; no implicit learning or device-specific retraining is involved. The network itself is trained once on mixed egocentric data and produces outputs in a normalized image space that is independent of the specific distortion. This is fully detailed in Section 3.3. To strengthen the abstract's presentation of the generalization claim, we will add a brief clause noting that the solver uses known camera intrinsics to handle diverse distortion models. revision: yes
Referee: [Experiments] Experiments section: The reported up to 28% MPJPE reduction on HOT3D and consistent performance across camera configurations are central to the SOTA claim, but the abstract lacks ablations isolating the forearm representation and unified transformer contributions, as well as error bars or training data details. Without these, it is not possible to confirm that the gains are not due to post-hoc tuning or dataset-specific factors, undermining verification of the cross-configuration robustness.

Authors: We agree that isolating component contributions and providing statistical details are essential for verifying the source of the reported gains. Ablations isolating the differentiable forearm representation and the unified arm-hand transformer are already presented in the experiments section (Section 4.4), with quantitative breakdowns showing their individual effects on camera-space accuracy. Error bars (standard deviation over three random seeds) are included in the main results tables and figures, and training data details—including dataset sizes, camera models, and splits—are described in Section 4.1. Note that space constraints preclude placing full ablations in the abstract; they belong in the experiments section. To further address the concern, we will add a compact training-data summary table and ensure error bars are explicitly referenced in the text discussing cross-configuration results. These changes will make it easier to confirm that the up to 28% MPJPE reduction on HOT3D and consistent performance arise from the proposed components rather than dataset-specific factors. revision: partial

Circularity Check

0 steps flagged

No significant circularity in EgoForce derivation

full rationale

The paper introduces novel architectural components (differentiable forearm representation, unified arm-hand transformer, ray-space closed-form solver) to address depth-scale ambiguity and enable unified handling across camera models. These are presented as new mechanisms rather than reductions of outputs to fitted inputs or self-citations. SOTA claims rest on external benchmark experiments (HOT3D and others) that provide independent validation, with no equations or steps in the abstract reducing predictions to prior fits by construction. The framework is self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that forearm geometry provides sufficient additional constraint to resolve monocular depth ambiguity and that a single network can generalize across optical models without per-device retraining.

axioms (2)

domain assumption A differentiable forearm representation stabilizes hand pose estimation
Invoked to justify the forearm component as a regularizer for depth-scale ambiguity.
domain assumption A unified arm-hand transformer can predict both hand and forearm geometry from a single egocentric view
Core architectural assumption enabling the single-network design.

pith-pipeline@v0.9.0 · 5600 in / 1328 out tokens · 25394 ms · 2026-05-13T05:19:20.523758+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 1 internal anchor

[1]

Intro- ducing hot3d: An egocentric dataset for 3d hand and object tracking.arXiv preprint arXiv:2406.09598, 2024

Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking , author=. arXiv preprint arXiv:2406.09598 , year=

work page arXiv
[2]

and Hilliges, Otmar , booktitle =

Fan, Zicong and Taheri, Omid and Tzionas, Dimitrios and Kocabas, Muhammed and Kaufmann, Manuel and Black, Michael J. and Hilliges, Otmar , booktitle =

work page
[3]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Kwon, Taein and Tekin, Bugra and St\"uhmer, Jan and Bogo, Federica and Pollefeys, Marc , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

work page 2021
[4]

Reconstructing Hands in 3

Pavlakos, Georgios and Shan, Dandan and Radosavovic, Ilija and Kanazawa, Angjoo and Fouhey, David and Malik, Jitendra , booktitle=. Reconstructing Hands in 3

work page
[5]

2024 , eprint=

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild , author=. 2024 , eprint=

work page 2024
[6]

27th International Conference on Neural Information Processing (ICONIP) , year =

MobileHand: Real-time 3D Hand Shape and Pose Estimation from Color Image , author =. 27th International Conference on Neural Information Processing (ICONIP) , year =

work page
[7]

CVPR , year =

Lin, Kevin and Wang, Lijuan and Liu, Zicheng , title =. CVPR , year =

work page
[8]

CVPR , year =

Moon, Gyeongsik , title =. CVPR , year =

work page
[9]

arXiv preprint arXiv:2501.02973 , year=

HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos , author=. arXiv preprint arXiv:2501.02973 , year=

work page arXiv
[10]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Vibe: Video inference for human body pose and shape estimation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[11]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Deformer: Dynamic fusion transformer for robust hand pose estimation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[12]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Beyond static features for temporally consistent 3d human pose and shape from a video , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[13]

ACM Transactions on Graphics, (Proc

Embodied Hands: Modeling and Capturing Hands and Bodies Together , author =. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) , volume =. 2017 , month_numeric =

work page 2017
[14]

2022 , eprint=

RTMDet: An Empirical Study of Designing Real-Time Object Detectors , author=. 2022 , eprint=

work page 2022
[15]

European Conference on Computer Vision , pages=

3D hand pose estimation in everyday egocentric images , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[16]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

End-to-end recovery of human shape and pose , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[17]

European Conference on Computer Vision (ECCV) , year =

Moon, Gyeongsik and Lee, Kyoung Mu , title =. European Conference on Computer Vision (ECCV) , year =

work page
[18]

The IEEE Conference on International Conference on Computer Vision (ICCV) , year =

Moon, Gyeongsik and Chang, Juyong and Lee, Kyoung Mu , title =. The IEEE Conference on International Conference on Computer Vision (ICCV) , year =

work page
[19]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Towards accurate alignment in real-time 3d hand-mesh reconstruction , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[20]

Proceedings of Computer Vision and Pattern Recognition (

Mueller, Franziska and Bernard, Florian and Sotnychenko, Oleksandr and Mehta, Dushyant and Sridhar, Srinath and Casas, Dan and Theobalt, Christian , title =. Proceedings of Computer Vision and Pattern Recognition (

work page
[21]

Proceedings of the European conference on computer vision (ECCV) , pages=

Hand pose estimation via latent 2.5 d heatmap regression , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

work page
[22]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Zhou, Yuxiao and Habermann, Marc and Xu, Weipeng and Habibie, Ikhsanul and Theobalt, Christian and Xu, Feng , title =. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page
[23]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Neural voting field for camera-space 3D hand pose estimation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[24]

Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Park, JoonKyu and Oh, Yeonguk and Moon, Gyeongsik and Choi, Hongsuk and Lee, Kyoung Mu , title =. Conference on Computer Vision and Pattern Recognition (CVPR) , year =

work page
[25]

Valassakis, Eugene and Garcia-Hernando, Guillermo , booktitle=

work page
[26]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Camera-Space Hand Mesh Recovery via Semantic Aggregationand Adaptive 2D-1D Registration , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page
[28]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Spectral graphormer: Spectral graph-based transformer for egocentric two-hand reconstruction using multi-view color images , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Spatial-temporal parallel transformer for arm-hand dynamic estimation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[30]

Applied Intelligence , volume=

Enhancing 3D hand pose estimation using SHaF: synthetic hand dataset including a forearm , author=. Applied Intelligence , volume=. 2024 , publisher=

work page 2024
[31]

and Pons-Moll, Gerard and Black, Michael J

Mahmood, Naureen and Ghorbani, Nima and Troje, Nikolaus F. and Pons-Moll, Gerard and Black, Michael J. , booktitle =. 2019 , month_numeric =

work page 2019
[32]

IEEE International Conference on Computer Vision (ICCV) , year =

Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russel, Max Argus and Thomas Brox , title =. IEEE International Conference on Computer Vision (ICCV) , year =

work page
[33]

Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , pages=

TouchInsight: Uncertainty-aware Rapid Touch and Text Input for Mixed Reality from Egocentric Vision , author=. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , pages=

work page
[34]

SIGGRAPH Asia 2022 conference papers , pages=

UmeTrack: Unified multi-view end-to-end hand tracking for VR , author=. SIGGRAPH Asia 2022 conference papers , pages=

work page 2022
[35]

IEEE Transactions on Visualization and Computer Graphics , volume=

Controllers or bare hands? a controlled evaluation of input techniques on interaction performance and exertion in virtual reality , author=. IEEE Transactions on Visualization and Computer Graphics , volume=. 2023 , publisher=

work page 2023
[36]

ACM SIGGRAPH 2022 Conference Proceedings , pages=

Neuralpassthrough: Learned real-time view synthesis for vr , author=. ACM SIGGRAPH 2022 Conference Proceedings , pages=

work page 2022
[37]

arXiv preprint arXiv:2503.05456 , year=

PinchCatcher: Enabling Multi-selection for Gaze+ Pinch , author=. arXiv preprint arXiv:2503.05456 , year=

work page arXiv
[38]

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

Atatouch: Robust finger pinch detection for a vr controller using rf return loss , author=. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

work page 2021
[39]

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

Uni-slam: Uncertainty-aware neural implicit slam for real-time dense indoor scene reconstruction , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

work page 2025
[40]

Army Personnel: Summary Statistics , author =

1988 Anthropometric Survey of U.S. Army Personnel: Summary Statistics , author =. 1989 , number =

work page 1988
[41]

International Conference on Learning Representations (ICLR) , year =

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. International Conference on Learning Representations (ICLR) , year =

work page
[42]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Learning structured output representation using deep conditional generative models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[43]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Vitpose: Simple vision transformer baselines for human pose estimation , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[45]

Technometrics , volume=

Ridge regression: Biased estimation for nonorthogonal problems , author=. Technometrics , volume=. 1970 , publisher=

work page 1970
[46]

Proceedings of the IEEE international conference on computer vision , pages=

Real-time hand tracking under occlusion from an egocentric rgb-d sensor , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[47]

2019 IEEE , author=

On the continuity of rotation representations in neural networks. 2019 IEEE , author=. CVF Conference on Computer Vision and Pattern Recognition (CVPR) , volume=

work page 2019
[48]

Proceedings of the IEEE international conference on computer vision workshops , pages=

3d pose regression using convolutional neural networks , author=. Proceedings of the IEEE international conference on computer vision workshops , pages=

work page
[49]

arXiv preprint arXiv:2011.07252 , year=

Ego2hands: A dataset for egocentric two-hand segmentation and detection , author=. arXiv preprint arXiv:2011.07252 , year=

work page arXiv 2011
[50]

2025 , eprint=

HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos , author=. 2025 , eprint=

work page 2025
[51]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Single-to-dual-view adaptation for egocentric 3d hand pose estimation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[52]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[53]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

2023 , eprint=

Project Aria: A New Tool for Egocentric Multi-Modal AI Research , author=. 2023 , eprint=

work page 2023
[55]

2021 , title=

Microsoft Learn , author=. 2021 , title=

work page 2021
[56]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[57]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019
[58]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Linear pose estimation from points or lines , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2003 , publisher=

work page 2003
[59]

2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003

Using many cameras as one , author=. 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. , volume=. 2003 , organization=

work page 2003
[60]

2015 , eprint=

Deep Residual Learning for Image Recognition , author=. 2015 , eprint=

work page 2015
[61]

Computer Vision and Pattern Recognition (CVPR) , year =

HOnnotate: A method for 3D Annotation of Hand and Object Poses , author=. Computer Vision and Pattern Recognition (CVPR) , year =

work page
[62]

Computer Vision and Pattern Recognition (CVPR) , year=

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives , author=. Computer Vision and Pattern Recognition (CVPR) , year=

work page
[63]

arXiv preprint arXiv:2406.12219 , year=

PCIE\_EgoHandPose Solution for EgoExo4D Hand Pose Challenge , author=. arXiv preprint arXiv:2406.12219 , year=

work page arXiv
[64]

Computer Vision and Pattern Recognition (CVPR) , year=

Learning 3d human dynamics from video , author=. Computer Vision and Pattern Recognition (CVPR) , year=

work page
[65]

German Conference on Pattern Recognition (DAGM ) , year=

Contrastive representation learning for hand shape estimation , author=. German Conference on Pattern Recognition (DAGM ) , year=

work page
[66]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Depth anything v2 , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[67]

arXiv preprint arXiv:2501.08329 , year=

Predicting 4d hand trajectory from monocular videos , author=. arXiv preprint arXiv:2501.08329 , year=

work page arXiv
[68]

European Conference on Computer Vision (ECCV) , year=

Mlphand: real time multi-view 3d hand reconstruction via mlp modeling , author=. European Conference on Computer Vision (ECCV) , year=

work page
[69]

Computer Vision and Pattern Recognition (CVPR) , year=

Dyn-hamr: Recovering 4d interacting hand motion from a dynamic camera , author=. Computer Vision and Pattern Recognition (CVPR) , year=

work page
[70]

Nationwide stature estimation from forearm length measurements in Montenegrin adolescents , author=. Int. j. morphol , volume=

work page
[71]

BMC pediatrics , volume=

Analysis of hand-forearm anthropometric components in assessing handgrip and pinch strengths of school-aged children and adolescents: a partial least squares (PLS) approach , author=. BMC pediatrics , volume=. 2021 , publisher=

work page 2021
[72]

European Conference on Computer Vision (ECCV) , year=

Upnp: An optimal o (n) solution to the absolute pose problem with universal applicability , author=. European Conference on Computer Vision (ECCV) , year=

work page
[73]

Computer Vision and Pattern Recognition (CVPR) , year=

End-to-end learnable geometric vision by backpropagating pnp optimization , author=. Computer Vision and Pattern Recognition (CVPR) , year=

work page
[74]

Computer Vision and Pattern Recognition (CVPR) , year=

AnyCalib: On-manifold learning for model-agnostic single-view camera calibration , author=. Computer Vision and Pattern Recognition (CVPR) , year=

work page
[75]

Unity Real-Time Development Platform , year =

work page
[76]

International Conference on 3D Vision (3DV) , year=

3D Pose Estimation of Two Interacting Hands from a Monocular Event Camera , author=. International Conference on 3D Vision (3DV) , year=

work page
[77]

International Journal of Computer Vision (IJCV) , year=

Millerdurai, Christen and Akada, Hiroyasu and Wang, Jian and Luvizon, Diogo and Pagani, Alain and Stricker, Didier and Theobalt, Christian and Golyanik, Vladislav , title=. International Journal of Computer Vision (IJCV) , year=

work page
[78]

Computer Vision and Pattern Recognition (CVPR) , year=

EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams , author=. Computer Vision and Pattern Recognition (CVPR) , year=

work page
[79]

The Computational Geometry Algorithms Library , author =

work page
[80]

Menelaos Karavelas , subtitle =

work page

Showing first 80 references.