DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
By: Junho Yoon , Jaemo Jung , Hyunju Kim and more
Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.
Similar Papers
EgoX: Egocentric Video Generation from a Single Exocentric Video
CV and Pattern Recognition
Turns normal videos into your own first-person view.
Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment
CV and Pattern Recognition
Finds unusual events in videos better.
Fine-grained Spatiotemporal Grounding on Egocentric Videos
CV and Pattern Recognition
Helps robots see and understand what they are looking at.