Multi-view Hand Reconstruction with a Point-Embedded Transformer

📅 2024-08-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
To address the generalization and robustness bottlenecks of multi-view hand mesh reconstruction in real-world scenarios, this paper proposes a point-embedded implicit modeling paradigm. It represents hand geometry as a learnable, multi-view-shared static 3D point set, and fuses aligned multi-view image features via a point-embedding Transformer. We innovatively introduce cross-dataset mixed training and camera-configuration randomization, significantly enhancing generalization to unseen camera layouts and out-of-domain data. The method unifies left- and right-hand modeling, is plug-and-play, and requires neither pose priors nor template deformation constraints. Evaluated on five large-scale multi-view datasets, it achieves high-accuracy, robust, and computationally efficient hand reconstruction under complex real-world conditions. Code and pre-trained models are publicly available.

Technology Category

Application Category

📝 Abstract
This work introduces a novel and generalizable multi-view Hand Mesh Reconstruction (HMR) model, named POEM, designed for practical use in real-world hand motion capture scenarios. The advances of the POEM model consist of two main aspects. First, concerning the modeling of the problem, we propose embedding a static basis point within the multi-view stereo space. A point represents a natural form of 3D information and serves as an ideal medium for fusing features across different views, given its varied projections across these views. Consequently, our method harnesses a simple yet effective idea: a complex 3D hand mesh can be represented by a set of 3D basis points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encompass the hand in it. The second advance lies in the training strategy. We utilize a combination of five large-scale multi-view datasets and employ randomization in the number, order, and poses of the cameras. By processing such a vast amount of data and a diverse array of camera configurations, our model demonstrates notable generalizability in the real-world applications. As a result, POEM presents a highly practical, plug-and-play solution that enables user-friendly, cost-effective multi-view motion capture for both left and right hands. The model and source codes are available at https://github.com/JubSteven/POEM-v2.
Problem

Research questions and friction points this paper is trying to address.

Reconstructs 3D hand mesh from multi-view images
Uses point-embedded transformer for feature fusion
Enhances generalizability with diverse training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Point-embedded transformer for multi-view hand mesh reconstruction
Static basis points fuse features across different views
Training with randomized multi-view datasets enhances generalizability
🔎 Similar Papers