Learning by Aligning 2D Skeleton Sequences and Multi-modality Fusion

📅 2023-05-31

🏛️ European Conference on Computer Vision

📈 Citations: 5

✨ Influential: 0

🤖 AI Summary

This work addresses the temporal alignment challenge of 2D skeleton sequences in fine-grained human activity understanding. Methodologically: (1) it replaces conventional 3D joint coordinates with 2D skeleton heatmap sequences as input; (2) it introduces a spatiotemporal joint self-attention video Transformer to extract discriminative features; and (3) it proposes a novel heatmap augmentation strategy alongside an RGB-skeleton multimodal feature fusion mechanism. Key contributions include: the first application of 2D heatmaps for self-supervised video temporal alignment; the introduction of a dedicated heatmap spatial augmentation paradigm; and the first systematic investigation of complementary modeling between 2D skeletons and RGB modalities for alignment tasks. Extensive experiments demonstrate state-of-the-art performance on Penn Action, IKEA ASM, and H2O datasets—outperforming CASA and other SOTA methods in alignment accuracy, robustness to missing keypoints, and tolerance to input noise.

📝 Abstract

This paper presents a self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding applications. In contrast with the state-of-the-art method of CASA, where sequences of 3D skeleton coordinates are taken directly as input, our key idea is to use sequences of 2D skeleton heatmaps as input. Unlike CASA which performs self-attention in the temporal domain only, we feed 2D skeleton heatmaps to a video transformer which performs self-attention both in the spatial and temporal domains for extracting effective spatiotemporal and contextual features. In addition, we introduce simple heatmap augmentation techniques based on 2D skeletons for self-supervised learning. Despite the lack of 3D information, our approach achieves not only higher accuracy but also better robustness against missing and noisy keypoints than CASA. Furthermore, extensive evaluations on three public datasets, i.e., Penn Action, IKEA ASM, and H2O, demonstrate that our approach outperforms previous methods in different fine-grained human activity understanding tasks. Finally, fusing 2D skeleton heatmaps with RGB videos yields the state-of-the-art on all metrics and datasets. To our best knowledge, our work is the first to utilize 2D skeleton heatmap inputs and the first to explore multi-modality fusion for temporal video alignment.

Problem

Research questions and friction points this paper is trying to address.

Self-supervised alignment of 2D skeleton sequences for activity understanding

Enhancing robustness against missing and noisy keypoints in skeleton data

Fusing 2D skeleton heatmaps with RGB videos for improved performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 2D skeleton heatmaps as input for alignment

Applies spatiotemporal self-attention via video transformer

Fuses 2D skeleton heatmaps with RGB videos

🔎 Similar Papers

No similar papers found.

Authors to Follow