Motion Reinforces Appearance: RGB-Skeleton Gated Residual Fusion for Micro-Gesture Online Recognition

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of insufficient single-modality information in online micro-gesture recognition, where subtle movements and high spontaneity hinder accurate detection. To this end, the authors propose DyFADet+, a dual-stream RGB-skeleton network that enables precise temporal localization and classification of micro-gesture instances in untrimmed videos. The method maps both modalities into a shared multi-scale temporal embedding space and introduces a gated residual fusion module that adaptively injects skeletal motion cues into RGB representations, replacing conventional feature concatenation. Coupled with a dynamic temporal action detection head, the framework supports online boundary regression and classification. Evaluated on the SMG dataset, the approach achieves an F1 score of 40.88, securing second place in the online micro-gesture recognition track of the IJCAI 2026 EI-MiGA Challenge.

📝 Abstract

Micro-gesture analysis attracts increasing attention for inferring spontaneous emotion from subtle body movements. Micro-gesture online recognition, which localizes and classifies each gesture instance in untrimmed videos, is a core task in the 4th EI-MiGA-IJCAI Challenge. Compared with typical temporal action detection, MGR emphasizes the localization and classification of actions, requiring the model to output the start time, end time, and category of each micro-gesture. Moreover, since micro-gestures are highly spontaneous, relying solely on a single modality makes it difficult to capture the complete and accurate multi-modal cues. In this work, we propose DyFADet+, which extends DyFADet into a dual-stream RGB-skeleton framework. In our model, both modalities are projected into shared multi-scale temporal embeddings and fused through a gated residual module, which adaptively injects skeleton motion into the RGB representation rather than using naive concatenation. Finally, these fused features are decoded by a Dynamic TAD head for online classification and boundary regression. On the SMG dataset, our method achieves an F1 score of 40.88, ranking 2nd in the Micro-gesture Online Recognition track.

Problem

Research questions and friction points this paper is trying to address.

micro-gesture

online recognition

multimodal fusion

temporal action detection

spontaneous emotion

Innovation

Methods, ideas, or system contributions that make the work stand out.

gated residual fusion

RGB-skeleton fusion

micro-gesture recognition

online temporal action detection

multi-modal embedding

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)