🤖 AI Summary
This study addresses the lack of a systematic survey in facial expression recognition research, which has hindered a clear understanding of methodological evolution and practical challenges. The work proposes a five-stage evolutionary framework and a seven-dimensional multi-criteria taxonomy—encompassing task type, input modality, network architecture, among others—and, for the first time, situates facial expression recognition within the broader context of facial affect analysis for integrated review. Through systematic examination of technical approaches including deep convolutional networks, attention mechanisms, vision-language models, and foundation models, alongside standardized preprocessing and diverse learning strategies, the paper comprehensively compares mainstream datasets, evaluation protocols, and benchmark performance. The analysis reveals the strengths and limitations of current methods in real-world scenarios, offering clear guidance for future research directions.
📝 Abstract
Facial Expression Recognition (FER) has advanced rapidly over the last decade, driven by the shift from handcrafted descriptors and shallow classifiers to deep convolutional, attention-based, vision-language, and foundation-model architectures, and by the parallel growth of large-scale in-the-wild benchmarks spanning categorical, dimensional, compound, micro-expression, Action Unit (AU), and intensity-estimation tasks. Yet the deep learning-based FER landscape has so far been reviewed only along narrow task-, architecture-, or application-specific axes, leaving a holistic, systematically organized account of its recent advances missing. This survey addresses that gap with a comprehensive review of recent deep learning-based FER, explicitly linked to the wider Facial Affect Recognition (FAR) domain. Its main contributions are: a) A description of FER's evolution into five distinct phases, from handcrafted features and classical machine learning to attention-based, vision-language, and foundation-model approaches, with the key milestone works of each, b) A multi-criteria taxonomy analyzing the literature along seven complementary axes: recognition task, input modality, face pre-processing pipeline, network architecture, learning strategy, acquisition setting, and application domain, c) A per-criterion comparative analysis, with critical insights into the strengths and limitations of each category under in-the-wild conditions, d) A task-organized review of public FER datasets, with their annotation schemes, modalities, and evaluation protocols, e) A compilation of performance metrics and a per-task quantitative comparison of representative state-of-the-art methods on widely adopted benchmarks, and f) A discussion of current challenges and promising future directions.