Diffusion-based 3D Hand Motion Recovery with Intuitive Physics

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low accuracy and temporal inconsistency in 3D hand motion recovery from monocular video sequences in hand–object interaction (HOI) scenarios. We propose a diffusion model–driven motion optimization framework that operates without image-level supervision. Leveraging only motion-capture data, the method learns a prior distribution over hand motions and explicitly incorporates physically grounded constraints—including hand–object contact, joint kinematic feasibility, and motion continuity—into the diffusion denoising process to model interaction dynamics. Compared to existing single-frame approaches, our framework achieves state-of-the-art performance on HOI-3D and FreiHAND+Object benchmarks, improving 3D pose accuracy by 18.7% (mean error reduction) and temporal smoothness by 32.4% (jerk metric). The method thus delivers high-fidelity, temporally coherent 3D hand reconstructions for complex HOI sequences.

Technology Category

Application Category

📝 Abstract
While 3D hand reconstruction from monocular images has made significant progress, generating accurate and temporally coherent motion estimates from videos remains challenging, particularly during hand-object interactions. In this paper, we present a novel 3D hand motion recovery framework that enhances image-based reconstructions through a diffusion-based and physics-augmented motion refinement model. Our model captures the distribution of refined motion estimates conditioned on initial ones, generating improved sequences through an iterative denoising process. Instead of relying on scarce annotated video data, we train our model only using motion capture data without images. We identify valuable intuitive physics knowledge during hand-object interactions, including key motion states and their associated motion constraints. We effectively integrate these physical insights into our diffusion model to improve its performance. Extensive experiments demonstrate that our approach significantly improves various frame-wise reconstruction methods, achieving state-of-the-art (SOTA) performance on existing benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Improving 3D hand motion accuracy in videos
Enhancing motion coherence during hand-object interactions
Training model without annotated video data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based motion refinement model
Physics-augmented hand-object interaction
Training with motion capture data
🔎 Similar Papers
No similar papers found.
Y
Yufei Zhang
Rensselaer Polytechnic Institute
Zijun Cui
Zijun Cui
Michigan State University
Knowledge-augmented Deep LearningProbabilistic Graphical ModelsComputer Vision
J
Jeffrey O. Kephart
IBM Research
Q
Qiang Ji
Rensselaer Polytechnic Institute