ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of Vision Transformers (ViT) in face recognition, which hinders deployment on resource-constrained devices. The authors propose ViT-FREE, a framework that introduces, for the first time, a training-agnostic early-exit mechanism to ViT-based face verification, enabling efficient inference without modifying or retraining the backbone network. ViT-FREE employs a multi-exit architecture that leverages intermediate features and attention maps, complemented by ViT-FREE_FT—a lightweight fine-tuning strategy using synthetic data to enhance the discriminative capability of shallow exits. Experiments demonstrate that exiting at layer 10 achieves a 20% speedup with only a 1.5% drop in verification accuracy on benchmarks such as IJB-C, and ViT-FREE_FT substantially improves the performance of early exits.
📝 Abstract
Vision Transformers (ViTs) have gained significant attention in computer vision and shown strong potential for face recognition (FR). However, their high computational cost makes deployment on resource-constrained devices challenging, motivating the need for methods that balance efficiency and accuracy. In this work, we investigate early exiting in pretrained ViTs as a simple yet effective training-free strategy for efficient FR inference. Leveraging the uniform feature dimensionality across transformer encoder blocks, we introduce ViT-FREE, a multi-exit framework that enables face verification directly from intermediate representations without modifying or retraining the backbone model, and thus, reducing inference cost. Empirically, we show that patch embeddings and attention maps evolve progressively across depth, exhibiting high similarity between consecutive ViT blocks and increasing alignment with the final representation. This indicates gradual feature refinement and attention convergence, suggesting that intermediate layers already provide stable and discriminative representations suitable for early exiting. Through extensive experiments on multiple FR benchmarks, we systematically analyze the accuracy-efficiency trade-off across exit depths. Our results demonstrate that later exits achieve a highly favorable balance, with exiting at layer 10 yielding up to a 20% speedup while incurring only a 1.5 drop in verification performance on benchmarks such as IJB-C. Also, we propose ViT-FREE_FT, a lightweight exit-specific fine-tuning strategy that adapts only the projection layers using a small synthetic dataset while keeping the transformer backbone frozen. This approach improves the performance of shallow exits while preserving the efficiency benefits and leaving deeper exits largely unaffected.
Problem

Research questions and friction points this paper is trying to address.

face recognition
Vision Transformers
efficient inference
early exiting
accuracy-efficiency trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

early exiting
Vision Transformers
face recognition
training-free
synthetic adaptation