DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the cross-modal alignment challenge between egocentric videos and environmental sensors, tackling two key limitations of existing methods: (P1) loss of fine-grained local action details due to global encoding, and (P2) semantic ambiguity arising from reliance on modality-invariant temporal patterns. We propose a spatiotemporal decoupled alignment framework comprising: (i) an online clustering–driven sensor–spatial feature discovery mechanism that explicitly models local motion structures; (ii) a two-stage mutual supervision strategy for spatial alignment; and (iii) an adaptive weighted spatiotemporal contrastive loss that jointly optimizes yet decouples spatial and temporal representations. Evaluated on Opportunity++ and HWU-USP, our method significantly outperforms adapted first-person–wearable baselines in downstream action recognition, demonstrating both effectiveness and generalizability of non-intrusive cross-modal alignment.

Technology Category

Application Category

📝 Abstract
Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.
Problem

Research questions and friction points this paper is trying to address.

Aligns exocentric video with ambient sensors for action recognition
Resolves misalignment from similar temporal patterns with different contexts
Improves local detail capture and semantic grounding in alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed spatio-temporal framework for alignment
Sensor-spatial features via online clustering
Two-stage spatial-temporal weighted contrastive loss
🔎 Similar Papers
No similar papers found.