🤖 AI Summary
To address the poor robustness of facial expression recognition in wild videos caused by domain shift, this paper proposes a lightweight and efficient test-time adaptation (TTA) paradigm. Methodologically, it introduces a novel Fisher information-based dynamic parameter selection mechanism that updates only ~22K parameters—over 20× fewer than existing TTA methods—and incorporates temporal consistency regularization to model inter-frame dependencies. Empirical analysis shows that reliable importance estimation can be achieved using merely one to three frames. On the AffWild2 benchmark, the approach improves F1-score by 7.7% over state-of-the-art TTA methods, while significantly reducing computational overhead, thereby enabling real-time deployment.
📝 Abstract
Robust facial expression recognition in unconstrained,"in-the-wild"environments remains challenging due to significant domain shifts between training and testing distributions. Test-time adaptation (TTA) offers a promising solution by adapting pre-trained models during inference without requiring labeled test data. However, existing TTA approaches typically rely on manually selecting which parameters to update, potentially leading to suboptimal adaptation and high computational costs. This paper introduces a novel Fisher-driven selective adaptation framework that dynamically identifies and updates only the most critical model parameters based on their importance as quantified by Fisher information. By integrating this principled parameter selection approach with temporal consistency constraints, our method enables efficient and effective adaptation specifically tailored for video-based facial expression recognition. Experiments on the challenging AffWild2 benchmark demonstrate that our approach significantly outperforms existing TTA methods, achieving a 7.7% improvement in F1 score over the base model while adapting only 22,000 parameters-more than 20 times fewer than comparable methods. Our ablation studies further reveal that parameter importance can be effectively estimated from minimal data, with sampling just 1-3 frames sufficient for substantial performance gains. The proposed approach not only enhances recognition accuracy but also dramatically reduces computational overhead, making test-time adaptation more practical for real-world affective computing applications.