AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking

πŸ“… 2026-05-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

227K/year
πŸ€– AI Summary
This work addresses the challenge of insufficient accuracy in 3D hand pose tracking during object interaction under severe occlusion by proposing a multimodal approach that fuses first-person vision with glove-based IMU signals. By synchronously capturing visual–IMU data and introducing a cross-sensor deep attention mechanism, the method adaptively modulates trust weights across modalities to achieve robust hand pose estimation. The contributions include the construction of DexGloveHOI, a large-scale real-world dataset comprising over 100,000 samples, and an end-to-end trainable framework based on the MANO hand model. Experiments demonstrate that, on the DexGloveHOI benchmark, the proposed method reduces the mean keypoint error by 16.1% and wrist alignment error by 24.2% compared to baseline approaches, substantially improving tracking performance in occluded scenarios.
πŸ“ Abstract
We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.
Problem

Research questions and friction points this paper is trying to address.

3D hand tracking
hand-object interaction
visual occlusion
vision-IMU fusion
egocentric sensing
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive fusion
vision-IMU
3D hand tracking
deep attention mechanism
hand-object interaction
πŸ”Ž Similar Papers