Multi-view Hand Reconstruction with a Point-Embedded Transformer

📅 2024-08-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the generalization and robustness bottlenecks of multi-view hand mesh reconstruction in real-world scenarios, this paper proposes a point-embedded implicit modeling paradigm. It represents hand geometry as a learnable, multi-view-shared static 3D point set, and fuses aligned multi-view image features via a point-embedding Transformer. We innovatively introduce cross-dataset mixed training and camera-configuration randomization, significantly enhancing generalization to unseen camera layouts and out-of-domain data. The method unifies left- and right-hand modeling, is plug-and-play, and requires neither pose priors nor template deformation constraints. Evaluated on five large-scale multi-view datasets, it achieves high-accuracy, robust, and computationally efficient hand reconstruction under complex real-world conditions. Code and pre-trained models are publicly available.

Technology Category

Application Category

📝 Abstract
This work introduces a novel and generalizable multi-view Hand Mesh Reconstruction (HMR) model, named POEM, designed for practical use in real-world hand motion capture scenarios. The advances of the POEM model consist of two main aspects. First, concerning the modeling of the problem, we propose embedding a static basis point within the multi-view stereo space. A point represents a natural form of 3D information and serves as an ideal medium for fusing features across different views, given its varied projections across these views. Consequently, our method harnesses a simple yet effective idea: a complex 3D hand mesh can be represented by a set of 3D basis points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encompass the hand in it. The second advance lies in the training strategy. We utilize a combination of five large-scale multi-view datasets and employ randomization in the number, order, and poses of the cameras. By processing such a vast amount of data and a diverse array of camera configurations, our model demonstrates notable generalizability in the real-world applications. As a result, POEM presents a highly practical, plug-and-play solution that enables user-friendly, cost-effective multi-view motion capture for both left and right hands. The model and source codes are available at https://github.com/JubSteven/POEM-v2.
Problem

Research questions and friction points this paper is trying to address.

Reconstructs 3D hand mesh from multi-view images
Uses point-embedded transformer for feature fusion
Enhances generalizability with diverse training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Point-embedded transformer for multi-view hand mesh reconstruction
Static basis points fuse features across different views
Training with randomized multi-view datasets enhances generalizability
🔎 Similar Papers
No similar papers found.
L
Lixin Yang
School of Artificial Intelligence (SAI), Shanghai Jiao Tong University, Shanghai 200240, China
Licheng Zhong
Licheng Zhong
NUS, SJTU, SQZ, Stanford
Computer VisionRobotics3D VisionEmbodied AI
Pengxiang Zhu
Pengxiang Zhu
University of California, Riverside
state estimationvisual-inertial navigationrobotics
Xinyu Zhan
Xinyu Zhan
Shanghai Jiao Tong University
J
Junxiao Kong
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
J
Jian Xu
Institute of Automation Chinese Academy of Sciences (CASIA), Beijing 100190, China
C
Cewu Lu
School of Artificial Intelligence (SAI), Shanghai Jiao Tong University, Shanghai 200240, China