AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking
Pith reviewed 2026-05-22 09:02 UTC · model grok-4.3
The pith
AVI-HT fuses egocentric vision with on-glove IMU signals through cross-sensor attention to improve 3D hand tracking accuracy during object interactions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AVI-HT jointly models egocentric images with on-glove 6-DoF IMU signals and uses a cross-sensor deep attention mechanism to adaptively modulate trust in the vision stream and each IMU sensor. Trained on synchronized multi-modal data paired with motion-capture ground truth, the method reduces mean keypoint error by 16.1 percent overall and 24.2 percent in its wrist-aligned variant across baselines on the DexGloveHOI dataset of real-world hand-object interactions.
What carries the argument
A cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors.
If this is right
- Tracking remains accurate even when visual occlusion is heavy during object manipulation.
- The same attention-based fusion works under two different hand models, UmeTrack and MANO.
- IMU contributions vary by finger and activity type, as shown in per-finger ablation results.
- Performance degrades in predictable ways when IMU noise or vision-IMU timing errors increase.
Where Pith is reading between the lines
- The attention fusion pattern could transfer to other occluded body parts such as feet or full-body tracking with added IMUs.
- Real-time versions might support more natural hand interaction in AR or VR without requiring perfect lighting or camera views.
- Expanding the dataset to include more varied objects and environments would test whether the current error gains hold outside controlled collection conditions.
Load-bearing premise
The synchronized multi-modal training data pairing on-body vision-IMU streams with motion-capture ground truth is sufficiently representative of real-world hand-object interactions and allows the attention mechanism to modulate trust without introducing artifacts from noise or misalignment.
What would settle it
Running AVI-HT on a held-out test set with deliberately increased temporal misalignment between vision and IMU streams or added IMU noise, and finding that the reported error reductions disappear or reverse, would falsify the central claim.
Figures
read the original abstract
We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AVI-HT, an adaptive vision-IMU fusion method for 3D hand pose tracking that combines egocentric images with on-glove 6-DoF IMU signals via a cross-sensor deep attention mechanism. It reports a new DexGloveHOI dataset of over 100K synchronized vision-IMU samples with mocap ground truth, and claims mean keypoint error reductions of 16.1% (24.2% for the wrist-aligned variant) over single- and multi-modal baselines under UmeTrack and MANO hand models, with ablations on per-finger IMU contributions and sensitivity to IMU noise/temporal misalignment.
Significance. If the central empirical claims hold under broader validation, the work would provide a practical multi-modal fusion strategy and a large-scale HOI dataset that could improve hand tracking robustness in occluded real-world settings; the attention-based trust modulation and per-finger ablation results are potentially useful for sensor-fusion literature.
major comments (2)
- The headline error reductions (16.1% and 24.2%) rest entirely on the authors' custom DexGloveHOI collection; to substantiate that the cross-sensor attention reliably modulates trust without injecting artifacts under realistic joint degradation, the manuscript should include explicit tests for simultaneous vision/IMU failure modes (e.g., motion blur combined with glove slip during grasping), as the existing ablations on individual IMU noise and temporal misalignment do not address these correlated cases.
- The assumption that the 100K+ synchronized mocap-paired samples are representative of diverse real-world hand-object interactions needs stronger support; without external validation sets or cross-dataset evaluation, hidden correlations between dataset construction and the fusion module could inflate the reported deltas.
minor comments (2)
- Clarify the exact architecture of the cross-sensor attention module (e.g., how per-IMU and vision features are embedded and how attention weights are computed) in the methods section to allow reproducibility.
- Specify the precise definitions of 'mean keypoint error' and 'wrist-aligned variant' when reporting quantitative results, including whether errors are averaged over all joints or only visible ones.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below and have incorporated revisions to strengthen the empirical validation and discussion of dataset representativeness.
read point-by-point responses
-
Referee: The headline error reductions (16.1% and 24.2%) rest entirely on the authors' custom DexGloveHOI collection; to substantiate that the cross-sensor attention reliably modulates trust without injecting artifacts under realistic joint degradation, the manuscript should include explicit tests for simultaneous vision/IMU failure modes (e.g., motion blur combined with glove slip during grasping), as the existing ablations on individual IMU noise and temporal misalignment do not address these correlated cases.
Authors: We agree that correlated failure modes represent an important robustness test not covered by the per-sensor ablations. In the revised manuscript we have added a new subsection with experiments that jointly degrade vision (via synthetic motion blur on egocentric frames) and IMU (via simulated glove slip offsets) during grasping sequences from DexGloveHOI. Results indicate the attention mechanism still yields 12–15% error reduction relative to baselines under these combined conditions, with attention weights shifting appropriately toward the less-degraded modality. We have updated the ablation table and added qualitative attention visualizations for these cases. revision: yes
-
Referee: The assumption that the 100K+ synchronized mocap-paired samples are representative of diverse real-world hand-object interactions needs stronger support; without external validation sets or cross-dataset evaluation, hidden correlations between dataset construction and the fusion module could inflate the reported deltas.
Authors: We acknowledge the value of external validation. The revised manuscript now includes an expanded dataset description section detailing the diversity of 12 object categories, 8 daily activity types, and 15 participants, along with quantitative coverage statistics. We also added a per-activity breakdown showing consistent gains across interaction types. Direct cross-dataset evaluation is currently limited by the absence of other large-scale synchronized vision-IMU HOI datasets with mocap ground truth; we have therefore added a limitations paragraph and commit to releasing the full DexGloveHOI dataset to enable such studies. revision: partial
Circularity Check
No circularity: empirical results on new dataset with independent comparisons
full rationale
The paper introduces AVI-HT as an engineering method combining vision and IMU via cross-sensor attention, evaluated via direct error metrics on the authors' newly collected DexGloveHOI dataset (100K+ synchronized samples). Reported gains (16.1% mean keypoint error reduction, 24.2% wrist-aligned) are explicit comparisons to baselines under UmeTrack and MANO models, accompanied by ablations on per-finger IMU contribution and sensitivity to noise/misalignment. No derivation chain, equations, or first-principles claims exist that reduce by construction to fitted inputs, self-definitions, or self-citation load-bearing premises. The central results rest on external mocap ground truth and standard benchmark-style testing rather than internal renaming or forced predictions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors... hierarchical cross-sensor attention module... kinematic prior mask M... Attention[1](Q,K,V) = softmax(QK⊤/√dk + M)V
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
IMU encoder... transformer encoder with multi-head self-attention... fused representation is decoded by a regression head
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation
Sridhar Pandian Arunachalam, Sneha Silwal, Ben Evans, and Lerrel Pinto. Dexterous imitation made easy: A learning-based framework for efficient dexterous manipulation. In2023 ieee international conference on robotics and automation (icra), pages 5954–5961. IEEE, 2023
work page 2023
-
[2]
Multimae: Multi-modal multi-task masked autoencoders
Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. Multimae: Multi-modal multi-task masked autoencoders. InEuropean conference on computer vision, pages 348–367. Springer, 2022
work page 2022
-
[3]
Hot3d: Hand and object tracking in 3d from egocentric multi-view videos
Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7071, 2025
work page 2025
-
[4]
Yiming Bao, Xu Zhao, and Dahong Qian. Fusepose: Imu-vision sensor fusion in kinematic space for parametric human pose estimation.IEEE Transactions on Multimedia, 25:7736–7746, 2022
work page 2022
-
[5]
3d hand shape and pose from images in the wild
Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. 3d hand shape and pose from images in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10843–10852, 2019
work page 2019
-
[6]
Hand avatar: Free-pose hand animation and rendering from monocular video
Xingyu Chen, Baoyuan Wang, and Heung-Yeung Shum. Hand avatar: Free-pose hand animation and rendering from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8683–8693, 2023
work page 2023
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
Imagebind: One embedding space to bind them all
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023
work page 2023
-
[9]
Shaoxiang Guo, Eric Rigall, Yakun Ju, and Junyu Dong. 3d hand pose estimation from monocular rgb with feature interaction module.IEEE Transactions on Circuits and Systems for Video Technology, 32(8):5293–5306, 2022
work page 2022
-
[10]
Honnotate: A method for 3d annotation of hand and object poses
Shreyas Hampali, Mahdi Rad, Markus Oberweger, and Vincent Lepetit. Honnotate: A method for 3d annotation of hand and object poses. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3196–3206, 2020
work page 2020
-
[11]
Shangchen Han, Beibei Liu, Randi Cabezas, Christopher D Twigg, Peizhao Zhang, Jeff Petkau, Tsz-Ho Yu, Chun-Jung Tai, Muzaffer Akbay, Zheng Wang, et al. Megatrack: monochrome egocentric articulated hand-tracking for virtual reality.ACM Transactions on Graphics (ToG), 39(4):87–1, 2020
work page 2020
-
[12]
Umetrack: Unified multi- view end-to-end hand tracking for vr
Shangchen Han, Po-chen Wu, Yubo Zhang, Beibei Liu, Linguang Zhang, Zheng Wang, Weiguang Si, Peizhao Zhang, Yujun Cai, Tomas Hodan, et al. Umetrack: Unified multi- view end-to-end hand tracking for vr. InSIGGRAPH Asia 2022 conference papers, pages 1–9, 2022
work page 2022
-
[13]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[14]
Yinghao Huang, Manuel Kaufmann, Emre Aksan, Michael J Black, Otmar Hilliges, and Gerard Pons-Moll. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time.ACM Transactions on Graphics (TOG), 37(6):1–15, 2018. 10
work page 2018
-
[15]
Daniel Laidig and Thomas Seel. Vqf: Highly accurate imu orientation estimation with bias estimation and magnetic disturbance rejection.Information Fusion, 91:187–204, 2023
work page 2023
-
[16]
Yu Lei, Yi Deng, Lin Dong, Xiaohui Li, Xiangnan Li, and Zhi Su. A novel sensor fusion approach for precise hand tracking in virtual reality-based human—computer interaction. Biomimetics, 8(3):326, 2023
work page 2023
-
[17]
Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation
John P Lewis, Matt Cordner, and Nickson Fong. Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 811–818. 2023
work page 2023
-
[18]
Ego-body pose estimation via ego-head pose estimation
Jiaman Li, Karen Liu, and Jiajun Wu. Ego-body pose estimation via ego-head pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17142–17151, 2023
work page 2023
-
[19]
Jiawen Li, Fei Jiang, Dandan Zhu, and Aimin Zhou. Handnet: Occlusion-robust 3d hand mesh reconstruction with prior information.Available at SSRN 5244153, 2025
work page 2025
-
[20]
Two-hand global 3d pose estimation using monocular rgb
Fanqing Lin, Connor Wilhelm, and Tony Martinez. Two-hand global 3d pose estimation using monocular rgb. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2373–2381, 2021
work page 2021
-
[21]
End-to-end human pose and mesh reconstruction with transformers
Kevin Lin, Lijuan Wang, and Zicheng Liu. End-to-end human pose and mesh reconstruction with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1954–1963, 2021
work page 1954
-
[22]
Kevin Lin, Lijuan Wang, and Zicheng Liu. Mesh graphormer. InProceedings of the IEEE/CVF international conference on computer vision, pages 12939–12948, 2021
work page 2021
-
[23]
Manolis LA Lourakis and Antonis A Argyros. Is levenberg-marquardt the most efficient optimization algorithm for implementing bundle adjustment? InTenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 2, pages 1526–1531. IEEE, 2005
work page 2005
-
[24]
Andualem T Maereg, Emanuele L Secco, Tayachew F Agidew, David Reid, and Atulya K Nagar. A low-cost, wearable opto-inertial 6-dof hand pose tracking system for vr.Technologies, 5(3):49, 2017
work page 2017
-
[25]
Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In European Conference on Computer Vision, pages 548–564. Springer, 2020
work page 2020
-
[26]
Real-time hand tracking under occlusion from an egocentric rgb-d sensor
Franziska Mueller, Dushyant Mehta, Oleksandr Sotnychenko, Srinath Sridhar, Dan Casas, and Christian Theobalt. Real-time hand tracking under occlusion from an egocentric rgb-d sensor. InProceedings of the IEEE international conference on computer vision, pages 1154–1163, 2017
work page 2017
-
[27]
Real-time and embedded detection of hand gestures with an imu-based glove
Chaithanya Kumar Mummadi, Frederic Philips Peter Leo, Keshav Deep Verma, Shivaji Kasireddy, Philipp M Scholl, Jochen Kempfle, and Kristof Van Laerhoven. Real-time and embedded detection of hand gestures with an imu-based glove. InInformatics, volume 5, page 28. MDPI, 2018
work page 2018
-
[28]
Takehiko Ohkawa, Ryosuke Furuta, and Yoichi Sato. Efficient annotation and learning for 3d hand pose estimation: A survey.International Journal of Computer Vision, 131(12):3193–3206, 2023
work page 2023
-
[29]
Fusing monocular images and sparse imu signals for real-time human motion capture
Shaohua Pan, Qi Ma, Xinyu Yi, Weifeng Hu, Xiong Wang, Xingkang Zhou, Jijunnan Li, and Feng Xu. Fusing monocular images and sparse imu signals for real-time human motion capture. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023
work page 2023
-
[30]
Reconstructing hands in 3d with transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024. 11
work page 2024
-
[31]
Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, and Dieter Fox. Anyteleop: A general vision-based dexterous robot arm-hand teleoperation system.arXiv preprint arXiv:2307.04577, 2023
-
[32]
Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), November 2017
work page 2017
-
[33]
Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration
Yu Rong, Takaaki Shiratori, and Hanbyul Joo. Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1749–1759, 2021
work page 2021
-
[34]
Real-time hand pose tracking using 6-axis imus
Anik Sarker, Ziyi Kou, Ergys Ristani, Li Guan, and Taylor Niehues. Real-time hand pose tracking using 6-axis imus. InProceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction, pages 1182–1191, 2026
work page 2026
-
[35]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[36]
Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation.Advances in neural information processing systems, 35:38571–38584, 2022
work page 2022
-
[37]
Xinyu Yi, Yuxiao Zhou, and Feng Xu. Transpose: Real-time 3d human translation and pose estimation with six inertial sensors.ACM Transactions On Graphics (TOG), 40(4):1–13, 2021
work page 2021
-
[38]
MediaPipe Hands: On-device Real-time Hand Tracking,
Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. Mediapipe hands: On-device real-time hand tracking.arXiv preprint arXiv:2006.10214, 2020
-
[39]
Haoran Zhang, Shuanghao Bai, Wanqi Zhou, Yuedi Zhang, Qi Zhang, Pengxiang Ding, Cheng Chi, Donglin Wang, and Badong Chen. Vcot-grasp: Grasp foundation models with visual chain- of-thought reasoning for language-driven grasp generation.arXiv preprint arXiv:2510.05827, 2025
-
[40]
Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani, Ankit Kumar, Lele Chen, Kun He, Abdeslam Boularias, and Li Guan. Glove2hand: Synthesizing natural hand-object interaction from multi-modal sensing gloves.arXiv preprint arXiv:2603.20850, 2026
-
[41]
Fusing wearable imus with multi- view images for human pose estimation: A geometric approach
Zhe Zhang, Chunyu Wang, Wenhu Qin, and Wenjun Zeng. Fusing wearable imus with multi- view images for human pose estimation: A geometric approach. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2200–2209, 2020
work page 2020
-
[42]
Learning to estimate 3d hand pose from single rgb images
Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from single rgb images. InProceedings of the IEEE international conference on computer vision, pages 4903–4911, 2017
work page 2017
-
[43]
Freihand: A dataset for markerless capture of hand pose and shape from single rgb images
Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. InProceedings of the IEEE/CVF international conference on computer vision, pages 813–822, 2019. 12 A Marker Based Mocap System Setup To obtain ground-truth 3D hand pose annotat...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.