🤖 AI Summary
To address the challenge of deploying large vision-language models like CLIP on resource-constrained edge devices—e.g., GPU-less retrofit automotive cameras—this paper proposes the first cross-architecture knowledge distillation framework tailored for edge deployment, enabling transfer of CLIP’s zero-shot capabilities to lightweight models. Methodologically, it employs an EfficientNet-B3 backbone coupled with a multi-layer MLP projection head, jointly optimizing a contrastive learning objective and a cross-modal feature alignment loss to compress text-image joint embeddings. Evaluated on an ARM Cortex-A72 processor with 2 GB RAM, the distilled model achieves real-time inference at 23 FPS while consuming under 180 MB memory. It attains zero-shot classification accuracy exceeding 92% of the original CLIP’s performance, marking the first demonstration of practical zero-shot image annotation on low-cost automotive edge hardware.
📝 Abstract
Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing. To address this challenge, we propose Clip4Retrofit, an efficient model distillation framework that enables real-time image labeling on edge devices. The framework is deployed on the Retrofit camera, a cost-effective edge device retrofitted into thousands of vehicles, despite strict limitations on compute performance and memory. Our approach distills the knowledge of the CLIP model into a lightweight student model, combining EfficientNet-B3 with multi-layer perceptron (MLP) projection heads to preserve cross-modal alignment while significantly reducing computational requirements. We demonstrate that our distilled model achieves a balance between efficiency and performance, making it ideal for deployment in real-world scenarios. Experimental results show that Clip4Retrofit can perform real-time image labeling and object identification on edge devices with limited resources, offering a practical solution for applications such as autonomous driving and retrofitting existing systems. This work bridges the gap between state-of-the-art vision-language models and their deployment in resource-constrained environments, paving the way for broader adoption of foundation models in edge computing.