🤖 AI Summary
This work addresses the challenge of capturing discriminative spatiotemporal representations in online micro-gesture recognition, where gestures exhibit extremely short durations, small amplitudes, and ambiguous visual cues. To overcome this, the authors propose a lightweight spatiotemporal decoupled adapter architecture that decomposes video adaptation into separate spatial and temporal branches, thereby preserving fine-grained patterns that are often lost in joint modeling. Additionally, they introduce an adaptive soft-balancing data augmentation strategy that dynamically adjusts augmentation intensity without relying on handcrafted thresholds, effectively mitigating long-tailed data distributions. Evaluated within a parameter-efficient fine-tuning framework, the proposed method achieves state-of-the-art performance, securing first place in Track 2 of the 4th EI-MiGA-IJCAI Challenge with an F1 score of 0.43808.
📝 Abstract
Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.