Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion

📅 2023-05-31
🏛️ European Conference on Computer Vision
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the temporal alignment challenge of 2D skeleton sequences in fine-grained human activity understanding. Methodologically: (1) it replaces conventional 3D joint coordinates with 2D skeleton heatmap sequences as input; (2) it introduces a spatiotemporal joint self-attention video Transformer to extract discriminative features; and (3) it proposes a novel heatmap augmentation strategy alongside an RGB-skeleton multimodal feature fusion mechanism. Key contributions include: the first application of 2D heatmaps for self-supervised video temporal alignment; the introduction of a dedicated heatmap spatial augmentation paradigm; and the first systematic investigation of complementary modeling between 2D skeletons and RGB modalities for alignment tasks. Extensive experiments demonstrate state-of-the-art performance on Penn Action, IKEA ASM, and H2O datasets—outperforming CASA and other SOTA methods in alignment accuracy, robustness to missing keypoints, and tolerance to input noise.
📝 Abstract
This paper presents a self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding applications. In contrast with the state-of-the-art method of CASA, where sequences of 3D skeleton coordinates are taken directly as input, our key idea is to use sequences of 2D skeleton heatmaps as input. Unlike CASA which performs self-attention in the temporal domain only, we feed 2D skeleton heatmaps to a video transformer which performs self-attention both in the spatial and temporal domains for extracting effective spatiotemporal and contextual features. In addition, we introduce simple heatmap augmentation techniques based on 2D skeletons for self-supervised learning. Despite the lack of 3D information, our approach achieves not only higher accuracy but also better robustness against missing and noisy keypoints than CASA. Furthermore, extensive evaluations on three public datasets, i.e., Penn Action, IKEA ASM, and H2O, demonstrate that our approach outperforms previous methods in different fine-grained human activity understanding tasks. Finally, fusing 2D skeleton heatmaps with RGB videos yields the state-of-the-art on all metrics and datasets. To our best knowledge, our work is the first to utilize 2D skeleton heatmap inputs and the first to explore multi-modality fusion for temporal video alignment.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised alignment of 2D skeleton sequences for activity understanding
Enhancing robustness against missing and noisy keypoints in skeleton data
Fusing 2D skeleton heatmaps with RGB videos for improved performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 2D skeleton heatmaps as input for alignment
Applies spatiotemporal self-attention via video transformer
Fuses 2D skeleton heatmaps with RGB videos
🔎 Similar Papers
No similar papers found.
Quoc-Huy Tran
Quoc-Huy Tran
Retrocausal, Inc.
Video UnderstandingAction Recognition3D PerceptionAutonomous Driving
M
Muhammad Ahmed
Retrocausal, Inc., Redmond, WA
M
Murad Popattia
Retrocausal, Inc., Redmond, WA
M
M. Hassan
Retrocausal, Inc., Redmond, WA
A
Ahmed Andrey
Retrocausal, Inc., Redmond, WA
K
Konin M. Zeeshan
Retrocausal, Inc., Redmond, WA