Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of fine-grained micro-gesture recognition under limited labeled data by proposing a multimodal ensemble framework that, for the first time, validates the effectiveness of self-supervised RGB representation learning for this task. The approach integrates a self-supervised RGB model pretrained via masked video modeling (MVM) with existing supervised multi-stream architectures, leveraging large-scale unlabeled videos to learn transferable features and fine-tuning on the iMiGUE dataset. Through a simple yet effective ensemble strategy, the single RGB model achieves a top-1 accuracy of 69.224%, which further improves to 74.419% after ensembling—surpassing the current state-of-the-art method by 1.206 percentage points.
📝 Abstract
In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.
Problem

Research questions and friction points this paper is trying to address.

micro-gesture recognition
self-supervised learning
unlabeled video data
transferable representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning
masked video modeling
micro-gesture recognition
multimodal ensemble
transferable representation