Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

176K/year
🤖 AI Summary
This work addresses the common oversight of hand motion in existing methods for natural language query-based moment localization in egocentric videos, despite the prevalence of hand–object interaction queries. To bridge this gap, the authors propose the first approach that explicitly models hand trajectories by introducing a hand skeleton sequence encoder to extract semantically rich motion features. They further design an adaptive gated cross-attention mechanism to effectively integrate these hand motion cues with pretrained video–text representations. Evaluated on the Ego4D NLQ v2 validation set, the method achieves significant performance gains: a 2.54-point improvement in R1@IoU=0.3 for hand–object interaction queries and a 4.32-point gain for quantity/state-related queries, demonstrating the efficacy of hand motion as a critical semantic signal.
📝 Abstract
Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.
Problem

Research questions and friction points this paper is trying to address.

Egocentric NLQ grounding
hand trajectory
hand-object interaction
temporal localization
first-person video
Innovation

Methods, ideas, or system contributions that make the work stand out.

hand trajectory
egocentric vision
natural language grounding
cross-attention fusion
hand-object interaction
E
Enmin Zhong
Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center, ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain
C
Carlos R. del-Blanco
Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center, ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain
Fernando Jaureguizar
Fernando Jaureguizar
Universidad Politecnica de Madrid
N
Narciso García
Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center, ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain