🤖 AI Summary
Existing physics-based monocular human pose tracking methods suffer from artifacts under non-planar ground or camera motion and often rely on synthetic data lacking realistic geometry, lighting, and motion modeling—leading to poor generalization. To address these limitations, we introduce MoviCam, the first real-world dynamic-camera dataset featuring authentic camera trajectories, scene geometry, and human–environment contact annotations. We further propose PhysDynPose, a novel algorithm that jointly integrates kinematic pose estimation, robust SLAM, and a scene-aware physical optimizer to map monocular poses into the world coordinate system while enforcing physically grounded constraints. Evaluated on our new benchmark, PhysDynPose significantly outperforms prior methods, delivering stable, high-accuracy global estimates of both human and camera poses. It is the first approach to achieve robust modeling in non-planar environments and under arbitrary camera motion, empirically demonstrating strong effectiveness and generalization capability in complex real-world scenarios.
📝 Abstract
Most monocular and physics-based human pose tracking methods, while achieving state-of-the-art results, suffer from artifacts when the scene does not have a strictly flat ground plane or when the camera is moving. Moreover, these methods are often evaluated on in-the-wild real world videos without ground-truth data or on synthetic datasets, which fail to model the real world light transport, camera motion, and pose-induced appearance and geometry changes. To tackle these two problems, we introduce MoviCam, the first non-synthetic dataset containing ground-truth camera trajectories of a dynamically moving monocular RGB camera, scene geometry, and 3D human motion with human-scene contact labels. Additionally, we propose PhysDynPose, a physics-based method that incorporates scene geometry and physical constraints for more accurate human motion tracking in case of camera motion and non-flat scenes. More precisely, we use a state-of-the-art kinematics estimator to obtain the human pose and a robust SLAM method to capture the dynamic camera trajectory, enabling the recovery of the human pose in the world frame. We then refine the kinematic pose estimate using our scene-aware physics optimizer. From our new benchmark, we found that even state-of-the-art methods struggle with this inherently challenging setting, i.e. a moving camera and non-planar environments, while our method robustly estimates both human and camera poses in world coordinates.