BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

📅 2025-04-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenging problem of reconstructing 3D hands and unknown objects interacting in monocular RGB video, where severe self-occlusion and category-agnostic object modeling pose significant difficulties. To this end, we propose the first Score Distillation Sampling (SDS) framework integrated with diffusion priors. Methodologically: (i) a pre-trained diffusion model provides geometric guidance for optimizing Gaussian splatting representations of unknown objects; (ii) a shared 3D Gaussian representation models both hands, augmented with hand-object interaction alignment constraints; and (iii) MANO hand priors and object Gaussian distributions are jointly optimized. Our approach achieves state-of-the-art performance on two benchmark datasets, significantly outperforming prior methods in hand pose accuracy (MPJPE ↓), object reconstruction quality (Chamfer Distance ↓, F-Score ↑), and novel-view synthesis fidelity (PSNR/SSIM ↑, LPIPS ↓).

Technology Category

Application Category

📝 Abstract
Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (i.e., MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in terms of 3D hand pose estimation (MPJPE), 3D object reconstruction (CDh, CDo, F10), and rendering quality (PSNR, SSIM, LPIPS), respectively.
Problem

Research questions and friction points this paper is trying to address.

Reconstruct bimanual hand-object interactions from monocular videos
Handle severe occlusions in complex two-hand unknown object interactions
Improve 3D alignment and rendering quality of interacting hands and objects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 3D Gaussian Splatting for bimanual interaction reconstruction
Leverages pre-trained diffusion model with SDS loss
Optimizes hand-object alignment during Gaussian optimization
🔎 Similar Papers
No similar papers found.
Jeongwan On
Jeongwan On
UNIST
Deep LearningComputer Vision
K
Kyeonghwan Gwak
UNIST, South Korea
G
Gunyoung Kang
UNIST, South Korea
Junuk Cha
Junuk Cha
KAIST
Deep Learning
S
Soohyun Hwang
UNIST, South Korea
H
Hyein Hwang
UNIST, South Korea
Seungryul Baek
Seungryul Baek
Associate Professor, UNIST
Deep LearningComputer VisionArticulated Pose EstimationAction and Gesture RecognitionObject