Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dynamic facial expression recognition (DFER) suffers from background noise and temporal semantic redundancy, leading to degraded accuracy and inefficiency. To address this, we propose a dual-branch collaborative framework: a supervised classification branch ensures discriminative feature learning, while a self-supervised reconstruction branch enforces spatiotemporal consistency. We introduce a novel learnable time-aware soft masking mechanism that jointly leverages class-agnostic and class-specific semantics to dynamically emphasize informative expression frames and suppress redundant temporal information. Further, our approach integrates spatiotemporal modeling via masked autoencoders, random binary hard masking for robustness, and parallel joint training. Evaluated on mainstream benchmarks including DFEW, our method reduces computational cost by 37% FLOPs while achieving state-of-the-art accuracy (e.g., 92.4% on DFEW), significantly improving the efficiency–accuracy trade-off.

Technology Category

Application Category

📝 Abstract
Dynamic Facial Expression Recognition (DFER) facilitates the understanding of psychological intentions through non-verbal communication. Existing methods struggle to manage irrelevant information, such as background noise and redundant semantics, which impacts both efficiency and effectiveness. In this work, we propose a novel supervised temporal soft masked autoencoder network for DFER, namely AdaTosk, which integrates a parallel supervised classification branch with the self-supervised reconstruction branch. The self-supervised reconstruction branch applies random binary hard mask to generate diverse training samples, encouraging meaningful feature representations in visible tokens. Meanwhile the classification branch employs an adaptive temporal soft mask to flexibly mask visible tokens based on their temporal significance. Its two key components, respectively of, class-agnostic and class-semantic soft masks, serve to enhance critical expression moments and reduce semantic redundancy over time. Extensive experiments conducted on widely-used benchmarks demonstrate that our AdaTosk remarkably reduces computational costs compared with current state-of-the-art methods while still maintaining competitive performance.
Problem

Research questions and friction points this paper is trying to address.

Dynamic Facial Expression Recognition (DFER) struggles with irrelevant information like background noise.
Existing DFER methods face challenges in balancing efficiency and effectiveness.
AdaTosk aims to reduce computational costs while maintaining competitive performance in DFER.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel supervised classification with self-supervised reconstruction
Adaptive temporal soft mask for dynamic expression recognition
Class-agnostic and class-semantic masks reduce semantic redundancy
🔎 Similar Papers
No similar papers found.
M
Mengzhu Li
Beijing Key Laboratory of Information Service Engineering, Beijing Union University
Quanxing Zha
Quanxing Zha
Huaqiao University
Hongjun Wu
Hongjun Wu
Nanyang Technological University
Cryptography