🤖 AI Summary
This work addresses the challenge of achieving high-accuracy yet low-overhead face recognition on mobile and edge devices by proposing FaceLiVTv2, a novel architecture that synergistically combines the strengths of CNNs and Transformers. Its key innovations include Lite Multi-Head Linear Attention (Lite MHLA) to reduce computational redundancy while preserving representational diversity, the RepMix module for unified modeling of local–global feature interactions, and global depthwise separable convolutions to enhance spatial aggregation during embedding. Extensive experiments demonstrate that FaceLiVTv2 significantly outperforms existing lightweight methods across multiple benchmarks—including LFW, CFP-FP, AgeDB-30, and IJB—achieving 22% lower inference latency than FaceLiVTv1, up to 30.8% faster speed than GhostFaceNets, and 20–41% reduced latency compared to EdgeFace and KANFace at higher accuracy levels.
📝 Abstract
Lightweight face recognition is increasingly important for deployment on edge and mobile devices, where strict constraints on latency, memory, and energy consumption must be met alongside reliable accuracy. Although recent hybrid CNN-Transformer architectures have advanced global context modeling, striking an effective balance between recognition performance and computational efficiency remains an open challenge. In this work, we present FaceLiVTv2, an improved version of our FaceLiVT hybrid architecture designed for efficient global--local feature interaction in mobile face recognition. At its core is Lite MHLA, a lightweight global token interaction module that replaces the original multi-layer attention design with multi-head linear token projections and affine rescale transformations, reducing redundancy while preserving representational diversity across heads. We further integrate Lite MHLA into a unified RepMix block that coordinates local and global feature interactions and adopts global depthwise convolution for adaptive spatial aggregation in the embedding stage. Under our experimental setup, results on LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB show that FaceLiVTv2 consistently improves the accuracy-efficiency trade-off over existing lightweight methods. Notably, FaceLiVTv2 reduces mobile inference latency by 22% relative to FaceLiVTv1, achieves speedups of up to 30.8% over GhostFaceNets on mobile devices, and delivers 20-41% latency improvements over EdgeFace and KANFace across platforms while maintaining higher recognition accuracy. These results demonstrate that FaceLiVTv2 offers a practical and deployable solution for real-time face recognition. Code is available at https://github.com/novendrastywn/FaceLiVT.