🤖 AI Summary
This work addresses the challenge of recovering camera orientation from egocentric videos under severe hand occlusion, where existing methods often fail. The authors propose an anatomy-informed kinematic coupling model that leverages the inherent motion correlation between the wrist and a head-mounted camera, enabling accurate camera pose estimation using only 4D wrist features. A lightweight GRU captures temporal dynamics of wrist motion within short time windows, eliminating the need for full-hand keypoints or scene geometry. Evaluated on the Epic Kitchens dataset, the method achieves zero-shot transfer with a median geodesic error of 14.3°, matching the performance of billion-parameter scene reconstruction models while using merely 200K parameters and significantly reducing reliance on visual appearance cues.
📝 Abstract
Recovering ego-camera orientation from manipulation video is a prerequisite for disentangling hand motion from camera motion, a key step in imitation learning from egocentric demonstrations. The obvious approach, inferring orientation from scene geometry, fails when hands occlude the frame: VGGT, a 1B-parameter scene reconstruction model, scores worse than a constant predictor on the TACO benchmark. We identify an alternative visual concept that is present precisely when scene geometry is absent: kinematic coupling dynamics, the structured physical relationship between wrist motion and camera orientation imposed by the arm-shoulder-head chain. We find that this concept is compact (4D inter-wrist features outperform 126D full hand keypoints), temporal (requiring a GRU over short windows rather than per-frame retrieval), and physically grounded (transferring zero-shot across datasets because it is rooted in anatomy rather than scene appearance). Trained only on tabletop manipulation, WristCompass transfers zero-shot to Epic Kitchens cooking video, achieving 14.3$^\circ$ median geodesic error and approaching the performance of a 1B-parameter scene model at 200K GRU parameters.