pith. sign in

arxiv: 2605.21714 · v1 · pith:CV4QEXQAnew · submitted 2026-05-20 · 💻 cs.CV · cs.RO

AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking

Pith reviewed 2026-05-22 09:02 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords 3D hand trackingvision-IMU fusioncross-sensor attentionhand-object interactionegocentric visionIMU sensorspose estimationmulti-modal tracking
0
0 comments X

The pith

AVI-HT fuses egocentric vision with on-glove IMU signals through cross-sensor attention to improve 3D hand tracking accuracy during object interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an adaptive fusion method called AVI-HT that combines egocentric camera images with 6-DoF IMU readings from a glove to estimate 3D hand poses. This approach targets scenarios where hands are heavily occluded by objects, a common issue in daily tasks. It depends on a new dataset of over 100,000 synchronized vision-IMU samples paired with motion-capture ground truth and uses a deep attention network to adjust how much each sensor is trusted at each moment. Experiments on this dataset with two hand models demonstrate lower average keypoint errors than single-modal or other multi-modal baselines. Ablations also examine how individual IMU sensors contribute to different fingers and activities.

Core claim

AVI-HT jointly models egocentric images with on-glove 6-DoF IMU signals and uses a cross-sensor deep attention mechanism to adaptively modulate trust in the vision stream and each IMU sensor. Trained on synchronized multi-modal data paired with motion-capture ground truth, the method reduces mean keypoint error by 16.1 percent overall and 24.2 percent in its wrist-aligned variant across baselines on the DexGloveHOI dataset of real-world hand-object interactions.

What carries the argument

A cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors.

If this is right

  • Tracking remains accurate even when visual occlusion is heavy during object manipulation.
  • The same attention-based fusion works under two different hand models, UmeTrack and MANO.
  • IMU contributions vary by finger and activity type, as shown in per-finger ablation results.
  • Performance degrades in predictable ways when IMU noise or vision-IMU timing errors increase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The attention fusion pattern could transfer to other occluded body parts such as feet or full-body tracking with added IMUs.
  • Real-time versions might support more natural hand interaction in AR or VR without requiring perfect lighting or camera views.
  • Expanding the dataset to include more varied objects and environments would test whether the current error gains hold outside controlled collection conditions.

Load-bearing premise

The synchronized multi-modal training data pairing on-body vision-IMU streams with motion-capture ground truth is sufficiently representative of real-world hand-object interactions and allows the attention mechanism to modulate trust without introducing artifacts from noise or misalignment.

What would settle it

Running AVI-HT on a held-out test set with deliberately increased temporal misalignment between vision and IMU streams or added IMU noise, and finding that the reported error reductions disappear or reverse, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.21714 by Ankit Kumar, Ergys Ristani, Li Guan, Mia Huang, Taylor Niehues, Vatsal Mehta, Ziyi Kou.

Figure 1
Figure 1. Figure 1: 3D hand tracking from egocentric vision and on-glove IMUs. (Left) Our multi-modal data capture setup that consists of an egocentric camera and a data sensing glove with 12 6-DoF IMU sensors. (Right) Qualitative comparison on 4 egocentric frames. For each frame, the top row shows an inset hand diagram. The bottom row shows the corresponding 3D hand pose tracking overlaid for UMETrack and AVI-HT against grou… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AVI-HT. A Vision Encoder converts the egocentric image into a global visual token Fvis. An IMU Encoder embeds a 14 timestamp temporal window from 12 on-glove sensors into per-sensor tokens {sk} via a transformer encoder and signal head. The two token sets are fused through a hierarchical cross-sensor attention module. A kinematic prior mask aggregates attended features based on physical sensor-… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of DexGloveHOI dataset. (Left) Sample egocentric images showing diverse activities captured with the data sensing glove. (Top right) Distribution of activity categories in Dex￾GloveHOI. (Bottom right) Key statistics of dataset. Each frame is paired with synchronized on-glove IMU signals and ground truth from MoCap system in both MANO and UMETrack representations. We design the capture protocol to … view at source ↗
Figure 4
Figure 4. Figure 4: Breakdown evaluation for AVI-HT. (Top) Absolute angle error across all 22 degrees of freedom for UMETrack, UMETrack + EKF, and AVI-HT. AVI-HT reduces error on most joints, with the largest gains on MCP flexion joints. (Bottom) Cumulative distribution of joint angle errors over all DexGloveHOI samples. AVI-HT achieves a higher fraction of samples below each error threshold. Ground-Truth AVI-HT (CV+IMU) UMET… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative tracking comparison. Two hand-object interaction sequences with finger occlusion moments. AVI-HT closely tracks the ground truth under heavy occlusion, while UMETrack deviates at unseen fingers from egocentric views. off-diagonal cells also show notable negative values, particularly the ones closer to the diagonal. We attribute this to the kinematic coupling of the hand. That is, neighboring fi… view at source ↗
Figure 6
Figure 6. Figure 6: Per-finger IMU contribution. (Left) MKPE.T Gap when only the corresponding IMU sensor group (row) is provided during training. (Right) MKPE.T Gap broken down by activity type [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity to IMU signal quality. (Left) MKPE and MKPE.T under increasing additive [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Marker Based Mocap System Setup B More Visualization for AVI-HT We provide additional qualitative comparisons between AVI-HT and UMETrack in [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More qualitative comparison between AVI-HT and UMETrack. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AVI-HT, an adaptive vision-IMU fusion method for 3D hand pose tracking that combines egocentric images with on-glove 6-DoF IMU signals via a cross-sensor deep attention mechanism. It reports a new DexGloveHOI dataset of over 100K synchronized vision-IMU samples with mocap ground truth, and claims mean keypoint error reductions of 16.1% (24.2% for the wrist-aligned variant) over single- and multi-modal baselines under UmeTrack and MANO hand models, with ablations on per-finger IMU contributions and sensitivity to IMU noise/temporal misalignment.

Significance. If the central empirical claims hold under broader validation, the work would provide a practical multi-modal fusion strategy and a large-scale HOI dataset that could improve hand tracking robustness in occluded real-world settings; the attention-based trust modulation and per-finger ablation results are potentially useful for sensor-fusion literature.

major comments (2)
  1. The headline error reductions (16.1% and 24.2%) rest entirely on the authors' custom DexGloveHOI collection; to substantiate that the cross-sensor attention reliably modulates trust without injecting artifacts under realistic joint degradation, the manuscript should include explicit tests for simultaneous vision/IMU failure modes (e.g., motion blur combined with glove slip during grasping), as the existing ablations on individual IMU noise and temporal misalignment do not address these correlated cases.
  2. The assumption that the 100K+ synchronized mocap-paired samples are representative of diverse real-world hand-object interactions needs stronger support; without external validation sets or cross-dataset evaluation, hidden correlations between dataset construction and the fusion module could inflate the reported deltas.
minor comments (2)
  1. Clarify the exact architecture of the cross-sensor attention module (e.g., how per-IMU and vision features are embedded and how attention weights are computed) in the methods section to allow reproducibility.
  2. Specify the precise definitions of 'mean keypoint error' and 'wrist-aligned variant' when reporting quantitative results, including whether errors are averaged over all joints or only visible ones.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and have incorporated revisions to strengthen the empirical validation and discussion of dataset representativeness.

read point-by-point responses
  1. Referee: The headline error reductions (16.1% and 24.2%) rest entirely on the authors' custom DexGloveHOI collection; to substantiate that the cross-sensor attention reliably modulates trust without injecting artifacts under realistic joint degradation, the manuscript should include explicit tests for simultaneous vision/IMU failure modes (e.g., motion blur combined with glove slip during grasping), as the existing ablations on individual IMU noise and temporal misalignment do not address these correlated cases.

    Authors: We agree that correlated failure modes represent an important robustness test not covered by the per-sensor ablations. In the revised manuscript we have added a new subsection with experiments that jointly degrade vision (via synthetic motion blur on egocentric frames) and IMU (via simulated glove slip offsets) during grasping sequences from DexGloveHOI. Results indicate the attention mechanism still yields 12–15% error reduction relative to baselines under these combined conditions, with attention weights shifting appropriately toward the less-degraded modality. We have updated the ablation table and added qualitative attention visualizations for these cases. revision: yes

  2. Referee: The assumption that the 100K+ synchronized mocap-paired samples are representative of diverse real-world hand-object interactions needs stronger support; without external validation sets or cross-dataset evaluation, hidden correlations between dataset construction and the fusion module could inflate the reported deltas.

    Authors: We acknowledge the value of external validation. The revised manuscript now includes an expanded dataset description section detailing the diversity of 12 object categories, 8 daily activity types, and 15 participants, along with quantitative coverage statistics. We also added a per-activity breakdown showing consistent gains across interaction types. Direct cross-dataset evaluation is currently limited by the absence of other large-scale synchronized vision-IMU HOI datasets with mocap ground truth; we have therefore added a limitations paragraph and commit to releasing the full DexGloveHOI dataset to enable such studies. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on new dataset with independent comparisons

full rationale

The paper introduces AVI-HT as an engineering method combining vision and IMU via cross-sensor attention, evaluated via direct error metrics on the authors' newly collected DexGloveHOI dataset (100K+ synchronized samples). Reported gains (16.1% mean keypoint error reduction, 24.2% wrist-aligned) are explicit comparisons to baselines under UmeTrack and MANO models, accompanied by ablations on per-finger IMU contribution and sensitivity to noise/misalignment. No derivation chain, equations, or first-principles claims exist that reduce by construction to fitted inputs, self-definitions, or self-citation load-bearing premises. The central results rest on external mocap ground truth and standard benchmark-style testing rather than internal renaming or forced predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is limited to the abstract; no explicit free parameters, axioms, or invented entities are detailed beyond the implicit reliance on synchronized mocap-paired training data as a domain assumption for supervised learning.

pith-pipeline@v0.9.0 · 5788 in / 1150 out tokens · 30252 ms · 2026-05-22T09:02:11.784263+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation

    Sridhar Pandian Arunachalam, Sneha Silwal, Ben Evans, and Lerrel Pinto. Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation. In2023 ieee international conference on robotics and automation (icra), pages 5954–5961. IEEE, 2023

  2. [2]

    Multimae: Multi-modal multi-task masked autoencoders

    Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. Multimae: Multi-modal multi-task masked autoencoders. InEuropean conference on computer vision, pages 348–367. Springer, 2022

  3. [3]

    Hot3d: Hand and object tracking in 3d from egocentric multi-view videos

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7071, 2025

  4. [4]

    Fusepose: Imu-vision sensor fusion in kinematic space for parametric human pose estimation.IEEE Transactions on Multimedia, 25:7736–7746, 2022

    Yiming Bao, Xu Zhao, and Dahong Qian. Fusepose: Imu-vision sensor fusion in kinematic space for parametric human pose estimation.IEEE Transactions on Multimedia, 25:7736–7746, 2022

  5. [5]

    3d hand shape and pose from images in the wild

    Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. 3d hand shape and pose from images in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10843–10852, 2019

  6. [6]

    Hand avatar: Free-pose hand animation and rendering from monocular video

    Xingyu Chen, Baoyuan Wang, and Heung-Yeung Shum. Hand avatar: Free-pose hand animation and rendering from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8683–8693, 2023

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  8. [8]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023

  9. [9]

    3d hand pose estimation from monocular rgb with feature interaction module.IEEE Transactions on Circuits and Systems for Video Technology, 32(8):5293–5306, 2022

    Shaoxiang Guo, Eric Rigall, Yakun Ju, and Junyu Dong. 3d hand pose estimation from monocular rgb with feature interaction module.IEEE Transactions on Circuits and Systems for Video Technology, 32(8):5293–5306, 2022

  10. [10]

    Honnotate: A method for 3d annotation of hand and object poses

    Shreyas Hampali, Mahdi Rad, Markus Oberweger, and Vincent Lepetit. Honnotate: A method for 3d annotation of hand and object poses. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3196–3206, 2020

  11. [11]

    Megatrack: monochrome egocentric articulated hand-tracking for virtual reality.ACM Transactions on Graphics (ToG), 39(4):87–1, 2020

    Shangchen Han, Beibei Liu, Randi Cabezas, Christopher D Twigg, Peizhao Zhang, Jeff Petkau, Tsz-Ho Yu, Chun-Jung Tai, Muzaffer Akbay, Zheng Wang, et al. Megatrack: monochrome egocentric articulated hand-tracking for virtual reality.ACM Transactions on Graphics (ToG), 39(4):87–1, 2020

  12. [12]

    Umetrack: Unified multi- view end-to-end hand tracking for vr

    Shangchen Han, Po-chen Wu, Yubo Zhang, Beibei Liu, Linguang Zhang, Zheng Wang, Weiguang Si, Peizhao Zhang, Yujun Cai, Tomas Hodan, et al. Umetrack: Unified multi- view end-to-end hand tracking for vr. InSIGGRAPH Asia 2022 conference papers, pages 1–9, 2022

  13. [13]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  14. [14]

    Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time.ACM Transactions on Graphics (TOG), 37(6):1–15, 2018

    Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J Black, Otmar Hilliges, and Gerard Pons-Moll. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time.ACM Transactions on Graphics (TOG), 37(6):1–15, 2018. 10

  15. [15]

    Vqf: Highly accurate imu orientation estimation with bias estimation and magnetic disturbance rejection.Information Fusion, 91:187–204, 2023

    Daniel Laidig and Thomas Seel. Vqf: Highly accurate imu orientation estimation with bias estimation and magnetic disturbance rejection.Information Fusion, 91:187–204, 2023

  16. [16]

    A novel sensor fusion approach for precise hand tracking in virtual reality-based human—computer interaction

    Yu Lei, Yi Deng, Lin Dong, Xiaohui Li, Xiangnan Li, and Zhi Su. A novel sensor fusion approach for precise hand tracking in virtual reality-based human—computer interaction. Biomimetics, 8(3):326, 2023

  17. [17]

    Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation

    John P Lewis, Matt Cordner, and Nickson Fong. Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 811–818. 2023

  18. [18]

    Ego-body pose estimation via ego-head pose estimation

    Jiaman Li, Karen Liu, and Jiajun Wu. Ego-body pose estimation via ego-head pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17142–17151, 2023

  19. [19]

    Handnet: Occlusion-robust 3d hand mesh reconstruction with prior information.Available at SSRN 5244153, 2025

    Jiawen Li, Fei Jiang, Dandan Zhu, and Aimin Zhou. Handnet: Occlusion-robust 3d hand mesh reconstruction with prior information.Available at SSRN 5244153, 2025

  20. [20]

    Two-hand global 3d pose estimation using monocular rgb

    Fanqing Lin, Connor Wilhelm, and Tony Martinez. Two-hand global 3d pose estimation using monocular rgb. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2373–2381, 2021

  21. [21]

    End-to-end human pose and mesh reconstruction with transformers

    Kevin Lin, Lijuan Wang, and Zicheng Liu. End-to-end human pose and mesh reconstruction with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1954–1963, 2021

  22. [22]

    Mesh graphormer

    Kevin Lin, Lijuan Wang, and Zicheng Liu. Mesh graphormer. InProceedings of the IEEE/CVF international conference on computer vision, pages 12939–12948, 2021

  23. [23]

    Manolis LA Lourakis and Antonis A Argyros. Is levenberg-marquardt the most efficient optimization algorithm for implementing bundle adjustment? InTenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 2, pages 1526–1531. IEEE, 2005

  24. [24]

    A low-cost, wearable opto-inertial 6-dof hand pose tracking system for vr.Technologies, 5(3):49, 2017

    Andualem T Maereg, Emanuele L Secco, Tayachew F Agidew, David Reid, and Atulya K Nagar. A low-cost, wearable opto-inertial 6-dof hand pose tracking system for vr.Technologies, 5(3):49, 2017

  25. [25]

    Interhand2

    Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In European Conference on Computer Vision, pages 548–564. Springer, 2020

  26. [26]

    Real-time hand tracking under occlusion from an egocentric rgb-d sensor

    Franziska Mueller, Dushyant Mehta, Oleksandr Sotnychenko, Srinath Sridhar, Dan Casas, and Christian Theobalt. Real-time hand tracking under occlusion from an egocentric rgb-d sensor. InProceedings of the IEEE international conference on computer vision, pages 1154–1163, 2017

  27. [27]

    Real-time and embedded detection of hand gestures with an imu-based glove

    Chaithanya Kumar Mummadi, Frederic Philips Peter Leo, Keshav Deep Verma, Shivaji Kasireddy, Philipp M Scholl, Jochen Kempfle, and Kristof Van Laerhoven. Real-time and embedded detection of hand gestures with an imu-based glove. InInformatics, volume 5, page 28. MDPI, 2018

  28. [28]

    Efficient annotation and learning for 3d hand pose estimation: A survey.International Journal of Computer Vision, 131(12):3193–3206, 2023

    Takehiko Ohkawa, Ryosuke Furuta, and Yoichi Sato. Efficient annotation and learning for 3d hand pose estimation: A survey.International Journal of Computer Vision, 131(12):3193–3206, 2023

  29. [29]

    Fusing monocular images and sparse imu signals for real-time human motion capture

    Shaohua Pan, Qi Ma, Xinyu Yi, Weifeng Hu, Xiong Wang, Xingkang Zhou, Jijunnan Li, and Feng Xu. Fusing monocular images and sparse imu signals for real-time human motion capture. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023

  30. [30]

    Reconstructing hands in 3d with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024. 11

  31. [31]

    Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

    Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023

  32. [32]

    Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), November 2017

  33. [33]

    Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration

    Yu Rong, Takaaki Shiratori, and Hanbyul Joo. Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1749–1759, 2021

  34. [34]

    Real-time hand pose tracking using 6-axis imus

    Anik Sarker, Ziyi Kou, Ergys Ristani, Li Guan, and Taylor Niehues. Real-time hand pose tracking using 6-axis imus. InProceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction, pages 1182–1191, 2026

  35. [35]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  36. [36]

    Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems, 35:38571–38584, 2022

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems, 35:38571–38584, 2022

  37. [37]

    Transpose: Real-time 3d human translation and pose estimation with six inertial sensors.ACM Transactions On Graphics (TOG), 40(4):1–13, 2021

    Xinyu Yi, Yuxiao Zhou, and Feng Xu. Transpose: Real-time 3d human translation and pose estimation with six inertial sensors.ACM Transactions On Graphics (TOG), 40(4):1–13, 2021

  38. [38]

    MediaPipe Hands: On-device Real-time Hand Tracking,

    Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. Mediapipe hands: On-device real-time hand tracking.arXiv preprint arXiv:2006.10214, 2020

  39. [39]

    Vcot-grasp: Grasp foundation models with visual chain- of-thought reasoning for language-driven grasp generation.arXiv preprint arXiv:2510.05827, 2025

    Haoran Zhang, Shuanghao Bai, Wanqi Zhou, Yuedi Zhang, Qi Zhang, Pengxiang Ding, Cheng Chi, Donglin Wang, and Badong Chen. Vcot-grasp: Grasp foundation models with visual chain- of-thought reasoning for language-driven grasp generation.arXiv preprint arXiv:2510.05827, 2025

  40. [40]

    Glove2hand: Synthesizing natural hand-object interaction from multi-modal sensing gloves.arXiv preprint arXiv:2603.20850, 2026

    Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani, Ankit Kumar, Lele Chen, Kun He, Abdeslam Boularias, and Li Guan. Glove2hand: Synthesizing natural hand-object interaction from multi-modal sensing gloves.arXiv preprint arXiv:2603.20850, 2026

  41. [41]

    Fusing wearable imus with multi- view images for human pose estimation: A geometric approach

    Zhe Zhang, Chunyu Wang, Wenhu Qin, and Wenjun Zeng. Fusing wearable imus with multi- view images for human pose estimation: A geometric approach. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2200–2209, 2020

  42. [42]

    Learning to estimate 3d hand pose from single rgb images

    Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from single rgb images. InProceedings of the IEEE international conference on computer vision, pages 4903–4911, 2017

  43. [43]

    Freihand: A dataset for markerless capture of hand pose and shape from single rgb images

    Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. InProceedings of the IEEE/CVF international conference on computer vision, pages 813–822, 2019. 12 A Marker Based Mocap System Setup To obtain ground-truth 3D hand pose annotat...