A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of low signal-to-noise ratio, long-tailed class distribution, and cross-subject domain shift in micro-gesture recognition from untrimmed videos by proposing a multimodal fusion framework that jointly models 68-point skeletal coordinates, 3D heatmaps, and high-resolution RGB features. The approach introduces a novel cross-modal pseudo-labeling strategy for unsupervised domain adaptation, incorporates an orthogonal semantic embedding loss together with a square-root smoothed weighting mechanism to enhance representation of tail classes, and employs temperature-scaled soft voting to mitigate overconfidence in multimodal fusion. Evaluated on the MiGA-IJCAI Challenge Track 1, the method achieves an F1-score of 68.13%, ranking fourth among competing approaches.

📝 Abstract

Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13\%, securing the 4th place.

Problem

Research questions and friction points this paper is trying to address.

micro-gesture recognition

cross-subject evaluation

long-tailed class distribution

domain shift

low signal-to-noise ratio

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Pseudo-Labeling

Orthogonal Semantic Embedding Loss

Saliency-Guided Multi-Modal Fusion