🤖 AI Summary
Facial age estimation faces dual challenges in modeling both local details and global semantics. To address this, we propose a hybrid architecture integrating ConvNeXt and Vision Transformer (ViT), featuring a novel lightweight cross-modal attention module that embeds ViT’s global contextual modeling capability into a CNN backbone—enhancing age-sensitive feature representation while preserving computational efficiency. We employ pretrained model initialization, linear fusion of multi-scale features, and strong regularization during training. Our method achieves state-of-the-art performance on MORPH II, CACD, and AFAD, with mean absolute errors of 2.31, 3.47, and 3.89 years, respectively—substantially outperforming pure CNN or pure ViT baselines. Ablation studies validate the critical role of our attention mechanism and fusion paradigm. This work establishes a reproducible, generalizable framework for fusing visual backbones, advancing the design of efficient and semantically rich architectures for age estimation.
📝 Abstract
Age estimation from facial images is a complex and multifaceted challenge in computer vision. In this study, we present a novel hybrid architecture that combines ConvNeXt, a state-of-the-art advancement of convolutional neural networks (CNNs), with Vision Transformers (ViT). While each model independently delivers excellent performance on a variety of tasks, their integration leverages the complementary strengths of the CNNs localized feature extraction capabilities and the Transformers global attention mechanisms. Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets, including MORPH II, CACD, and AFAD, and achieved superior performance in terms of mean absolute error (MAE). To address computational constraints, we leverage pre-trained models and systematically explore different configurations, using linear layers and advanced regularization techniques to optimize the architecture. Comprehensive ablation studies highlight the critical role of individual components and training strategies, and in particular emphasize the importance of adapted attention mechanisms within the CNN framework to improve the model focus on age-relevant facial features. The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation and related visual tasks. This work underscores the transformative potential of hybrid architectures and represents a promising direction for the seamless integration of CNNs and transformers to address complex computer vision challenges.