🤖 AI Summary
Monocular 3D hand mesh reconstruction suffers from severe self-occlusion, strong 2D–3D mapping ambiguity, and high joint degrees of freedom, leading to low vertex localization accuracy and poor inference efficiency. To address these challenges, this paper proposes an end-to-end vertex-level regression framework. Its key contributions are: (1) a novel Dynamic Spiral Convolution (DSC) layer that jointly adapts spatial and channel-wise features according to hand topology; (2) an anatomy-aware Region-of-Interest (ROI) attention mechanism to enhance representation learning for critical joints and occluded regions; and (3) a lightweight, 2D-guided 3D vertex regression architecture. Evaluated on the FreiHAND benchmark, our method outperforms existing real-time approaches, achieving a 12.6% reduction in vertex error and an inference speed of 38 FPS—demonstrating both state-of-the-art accuracy and efficiency.
📝 Abstract
Monocular 3D hand mesh recovery is challenging due to high degrees of freedom of hands, 2D-to-3D ambiguity and self-occlusion. Most existing methods are either inefficient or less straightforward for predicting the position of 3D mesh vertices. Thus, we propose a new pipeline called Monocular 3D Hand Mesh Recovery (M3DHMR) to directly estimate the positions of hand mesh vertices. M3DHMR provides 2D cues for 3D tasks from a single image and uses a new spiral decoder consist of several Dynamic Spiral Convolution (DSC) Layers and a Region of Interest (ROI) Layer. On the one hand, DSC Layers adaptively adjust the weights based on the vertex positions and extract the vertex features in both spatial and channel dimensions. On the other hand, ROI Layer utilizes the physical information and refines mesh vertices in each predefined hand region separately. Extensive experiments on popular dataset FreiHAND demonstrate that M3DHMR significantly outperforms state-of-the-art real-time methods.