๐ค AI Summary
This work addresses the challenge of micro-gesture recognition, where subtle inter-class differences hinder effective modeling of local dynamic features under conventional class-level supervision. To overcome this limitation, the authors propose a fine-grained semantic-guided learning framework, introducing the first micro-gesture dataset annotated with four-dimensional fine-grained textual descriptions. They design a multi-level contrastive optimization strategy to enable coarse-to-fine joint training and incorporate two novel attention modulesโFine-Grained Semantic Attention (FG-SA) and Class-Level Prototype Attention (CP-A)โto guide vision-language models toward discriminative local motion cues. Experimental results demonstrate that the proposed approach achieves competitive performance on micro-gesture recognition benchmarks, validating the efficacy of fine-grained semantic guidance in enhancing recognition accuracy.
๐ Abstract
Micro-gesture recognition (MGR) is challenging due to subtle inter-class variations. Existing methods rely on category-level supervision, which is insufficient for capturing subtle and localized motion differences. Thus, this paper proposes a Fine-Grained Semantic Guidance Learning (FG-SGL) framework that jointly integrates fine-grained and category-level semantics to guide vision--language models in perceiving local MG motions. FG-SA adopts fine-grained semantic cues to guide the learning of local motion features, while CP-A enhances the separability of MG features through category-level semantic guidance. To support fine-grained semantic guidance, this work constructs a fine-grained textual dataset with human annotations that describes the dynamic process of MGs in four refined semantic dimensions. Furthermore, a Multi-Level Contrastive Optimization strategy is designed to jointly optimize both modules in a coarse-to-fine pattern. Experiments show that FG-SGL achieves competitive performance, validating the effectiveness of fine-grained semantic guidance for MGR.