๐ค AI Summary
This work proposes a diffusion-based framework for generating high-quality 3D human motion from monocular videos in the absence of ground-truth 3D data. The method learns a 3D motion prior using only 2D pose sequences, leveraging a pretrained 2D-to-3D lifter to provide noisy 3D teacher signals. Denoising occurs directly in 3D space, supervised by a depth-weighted 2D reprojection loss. The approach innovatively constructs a coherent 3D motion manifold, achieving performance close to fully 3D-supervised methods using only 2D supervision. Theoretically, the proposed loss is shown to be equivalent in expectation to direct 3D supervision. Motion plausibility is further enhanced through velocity consistency and over-parameterized representation alignment. Experiments demonstrate strong results: on HumanML3D, the model achieves an FID of 0.88โapproaching the fully supervised MDMโs 0.54โand shows superior quantitative metrics and human preference on real-world datasets Fit3D and NBA.
๐ Abstract
We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.