A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

📅 2026-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of low signal-to-noise ratio, long-tailed class distribution, and cross-subject domain shift in micro-gesture recognition from untrimmed videos by proposing a multimodal fusion framework that jointly models 68-point skeletal coordinates, 3D heatmaps, and high-resolution RGB features. The approach introduces a novel cross-modal pseudo-labeling strategy for unsupervised domain adaptation, incorporates an orthogonal semantic embedding loss together with a square-root smoothed weighting mechanism to enhance representation of tail classes, and employs temperature-scaled soft voting to mitigate overconfidence in multimodal fusion. Evaluated on the MiGA-IJCAI Challenge Track 1, the method achieves an F1-score of 68.13%, ranking fourth among competing approaches.
📝 Abstract
Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13\%, securing the 4th place.
Problem

Research questions and friction points this paper is trying to address.

micro-gesture recognition
cross-subject evaluation
long-tailed class distribution
domain shift
low signal-to-noise ratio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Pseudo-Labeling
Orthogonal Semantic Embedding Loss
Saliency-Guided Multi-Modal Fusion
Long-Tailed Recognition
Unsupervised Domain Adaptation
H
Haoran Zhang
School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China
H
Haokun Zhang
School of Computer Science, University of Auckland (UOA), Auckland, New Zealand
P
Pengyu Liu
School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China
Y
Yujia Zhang
School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China
W
Weibao Xue
School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China
Yanbin Hao
Yanbin Hao
Hefei University of Technology
Video retrievalvideo action recognitionhashingVideo Hyperlinking