WristCompass: Kinematic Coupling as a Learnable Visual Concept for Ego-Camera Orientation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge of recovering camera orientation from egocentric videos under severe hand occlusion, where existing methods often fail. The authors propose an anatomy-informed kinematic coupling model that leverages the inherent motion correlation between the wrist and a head-mounted camera, enabling accurate camera pose estimation using only 4D wrist features. A lightweight GRU captures temporal dynamics of wrist motion within short time windows, eliminating the need for full-hand keypoints or scene geometry. Evaluated on the Epic Kitchens dataset, the method achieves zero-shot transfer with a median geodesic error of 14.3°, matching the performance of billion-parameter scene reconstruction models while using merely 200K parameters and significantly reducing reliance on visual appearance cues.

📝 Abstract

Recovering ego-camera orientation from manipulation video is a prerequisite for disentangling hand motion from camera motion, a key step in imitation learning from egocentric demonstrations. The obvious approach, inferring orientation from scene geometry, fails when hands occlude the frame: VGGT, a 1B-parameter scene reconstruction model, scores worse than a constant predictor on the TACO benchmark. We identify an alternative visual concept that is present precisely when scene geometry is absent: kinematic coupling dynamics, the structured physical relationship between wrist motion and camera orientation imposed by the arm-shoulder-head chain. We find that this concept is compact (4D inter-wrist features outperform 126D full hand keypoints), temporal (requiring a GRU over short windows rather than per-frame retrieval), and physically grounded (transferring zero-shot across datasets because it is rooted in anatomy rather than scene appearance). Trained only on tabletop manipulation, WristCompass transfers zero-shot to Epic Kitchens cooking video, achieving 14.3$^\circ$ median geodesic error and approaching the performance of a 1B-parameter scene model at 200K GRU parameters.

Problem

Research questions and friction points this paper is trying to address.

ego-camera orientation

kinematic coupling

imitation learning

egocentric video

hand-camera disentanglement

Innovation

Methods, ideas, or system contributions that make the work stand out.

kinematic coupling

ego-camera orientation

zero-shot transfer

wrist dynamics

GRU-based temporal modeling

🔎 Similar Papers

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

2024-10-02arXiv.orgCitations: 0