Motion Reinforces Appearance: RGB-Skeleton Gated Residual Fusion for Micro-Gesture Online Recognition

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of insufficient single-modality information in online micro-gesture recognition, where subtle movements and high spontaneity hinder accurate detection. To this end, the authors propose DyFADet+, a dual-stream RGB-skeleton network that enables precise temporal localization and classification of micro-gesture instances in untrimmed videos. The method maps both modalities into a shared multi-scale temporal embedding space and introduces a gated residual fusion module that adaptively injects skeletal motion cues into RGB representations, replacing conventional feature concatenation. Coupled with a dynamic temporal action detection head, the framework supports online boundary regression and classification. Evaluated on the SMG dataset, the approach achieves an F1 score of 40.88, securing second place in the online micro-gesture recognition track of the IJCAI 2026 EI-MiGA Challenge.
📝 Abstract
Micro-gesture analysis attracts increasing attention for inferring spontaneous emotion from subtle body movements. Micro-gesture online recognition, which localizes and classifies each gesture instance in untrimmed videos, is a core task in the 4th EI-MiGA-IJCAI Challenge. Compared with typical temporal action detection, MGR emphasizes the localization and classification of actions, requiring the model to output the start time, end time, and category of each micro-gesture. Moreover, since micro-gestures are highly spontaneous, relying solely on a single modality makes it difficult to capture the complete and accurate multi-modal cues. In this work, we propose DyFADet+, which extends DyFADet into a dual-stream RGB-skeleton framework. In our model, both modalities are projected into shared multi-scale temporal embeddings and fused through a gated residual module, which adaptively injects skeleton motion into the RGB representation rather than using naive concatenation. Finally, these fused features are decoded by a Dynamic TAD head for online classification and boundary regression. On the SMG dataset, our method achieves an F1 score of 40.88, ranking 2nd in the Micro-gesture Online Recognition track.
Problem

Research questions and friction points this paper is trying to address.

micro-gesture
online recognition
multimodal fusion
temporal action detection
spontaneous emotion
Innovation

Methods, ideas, or system contributions that make the work stand out.

gated residual fusion
RGB-skeleton fusion
micro-gesture recognition
online temporal action detection
multi-modal embedding
🔎 Similar Papers
No similar papers found.
J
Jialin Liu
School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China
X
Xinwen He
School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China
P
Pengyu Liu
School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China
J
Jiale Shi
School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China
H
Huaijuan Zang
School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China
Yanbin Hao
Yanbin Hao
Hefei University of Technology
Video retrievalvideo action recognitionhashingVideo Hyperlinking