๐ค AI Summary
This work proposes a physics-aware conditional diffusion framework to address the common lack of physical consistency in existing hand motion reconstruction methods and the difficulty in quantifying the physical plausibility of estimated results. By incorporating dynamical residuals as virtual observations into the diffusion process and integrating EulerโLagrange dynamics modeling with a MeshCNN-Transformer backbone, the method refines noisy pose sequences into physically plausible hand motions. Innovatively, a Laplace approximation is applied at the final layer of the diffusion model to produce spatiotemporally interpretable variance maps that reflect physical consistency. Experiments demonstrate that the proposed approach outperforms strong image-initialized and state-of-the-art video-based methods on two mainstream hand datasets, with qualitative results confirming a high correspondence between the estimated variance and physical plausibility.
๐ Abstract
Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN-Transformer backbone, we formulate Euler-Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.