🤖 AI Summary
Micro-expression recognition remains challenging due to extremely short durations (<500 ms) and highly localized facial movements, resulting in long-standing performance saturation near 50% accuracy—even for expert annotators. To address this, we propose TSFmicro, a novel framework featuring a parallel spatiotemporal fusion mechanism that explicitly models the semantic complementarity between spatial location (“where”) and motion pattern (“how”) within a high-dimensional feature space. TSFmicro synergistically integrates RetNet’s capability for long-range temporal modeling with Transformer’s multi-scale spatiotemporal interaction, enabling joint representation learning and multimodal feature fusion for dynamic micro-expressions. Evaluated on three benchmark datasets—CASME II, SAMM, and MMEW—TSFmicro achieves significant improvements over existing state-of-the-art methods, boosting average accuracy by 6.2–9.8 percentage points and, for the first time, systematically surpassing the 50% recognition barrier.
📝 Abstract
When emotions are repressed, an individual's true feelings may be revealed through micro-expressions. Consequently, micro-expressions are regarded as a genuine source of insight into an individual's authentic emotions. However, the transient and highly localised nature of micro-expressions poses a significant challenge to their accurate recognition, with the accuracy rate of micro-expression recognition being as low as 50%, even for professionals. In order to address these challenges, it is necessary to explore the field of dynamic micro expression recognition (DMER) using multimodal fusion techniques, with special attention to the diverse fusion of temporal and spatial modal features. In this paper, we propose a novel Temporal and Spatial feature Fusion framework for DMER (TSFmicro). This framework integrates a Retention Network (RetNet) and a transformer-based DMER network, with the objective of efficient micro-expression recognition through the capture and fusion of temporal and spatial relations. Meanwhile, we propose a novel parallel time-space fusion method from the perspective of modal fusion, which fuses spatio-temporal information in high-dimensional feature space, resulting in complementary"where-how"relationships at the semantic level and providing richer semantic information for the model. The experimental results demonstrate the superior performance of the TSFmicro method in comparison to other contemporary state-of-the-art methods. This is evidenced by its effectiveness on three well-recognised micro-expression datasets.