WristCompass: Kinematic Coupling as a Learnable Visual Concept for Ego-Camera Orientation

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

187K/year
🤖 AI Summary
This work addresses the challenge of recovering camera orientation from egocentric videos under severe hand occlusion, where existing methods often fail. The authors propose an anatomy-informed kinematic coupling model that leverages the inherent motion correlation between the wrist and a head-mounted camera, enabling accurate camera pose estimation using only 4D wrist features. A lightweight GRU captures temporal dynamics of wrist motion within short time windows, eliminating the need for full-hand keypoints or scene geometry. Evaluated on the Epic Kitchens dataset, the method achieves zero-shot transfer with a median geodesic error of 14.3°, matching the performance of billion-parameter scene reconstruction models while using merely 200K parameters and significantly reducing reliance on visual appearance cues.
📝 Abstract
Recovering ego-camera orientation from manipulation video is a prerequisite for disentangling hand motion from camera motion, a key step in imitation learning from egocentric demonstrations. The obvious approach, inferring orientation from scene geometry, fails when hands occlude the frame: VGGT, a 1B-parameter scene reconstruction model, scores worse than a constant predictor on the TACO benchmark. We identify an alternative visual concept that is present precisely when scene geometry is absent: kinematic coupling dynamics, the structured physical relationship between wrist motion and camera orientation imposed by the arm-shoulder-head chain. We find that this concept is compact (4D inter-wrist features outperform 126D full hand keypoints), temporal (requiring a GRU over short windows rather than per-frame retrieval), and physically grounded (transferring zero-shot across datasets because it is rooted in anatomy rather than scene appearance). Trained only on tabletop manipulation, WristCompass transfers zero-shot to Epic Kitchens cooking video, achieving 14.3$^\circ$ median geodesic error and approaching the performance of a 1B-parameter scene model at 200K GRU parameters.
Problem

Research questions and friction points this paper is trying to address.

ego-camera orientation
kinematic coupling
imitation learning
egocentric video
hand-camera disentanglement
Innovation

Methods, ideas, or system contributions that make the work stand out.

kinematic coupling
ego-camera orientation
zero-shot transfer
wrist dynamics
GRU-based temporal modeling