Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenge of deploying large vision-language models like CLIP on resource-constrained edge devices—e.g., GPU-less retrofit automotive cameras—this paper proposes the first cross-architecture knowledge distillation framework tailored for edge deployment, enabling transfer of CLIP’s zero-shot capabilities to lightweight models. Methodologically, it employs an EfficientNet-B3 backbone coupled with a multi-layer MLP projection head, jointly optimizing a contrastive learning objective and a cross-modal feature alignment loss to compress text-image joint embeddings. Evaluated on an ARM Cortex-A72 processor with 2 GB RAM, the distilled model achieves real-time inference at 23 FPS while consuming under 180 MB memory. It attains zero-shot classification accuracy exceeding 92% of the original CLIP’s performance, marking the first demonstration of practical zero-shot image annotation on low-cost automotive edge hardware.

Technology Category

Application Category

📝 Abstract

Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing. To address this challenge, we propose Clip4Retrofit, an efficient model distillation framework that enables real-time image labeling on edge devices. The framework is deployed on the Retrofit camera, a cost-effective edge device retrofitted into thousands of vehicles, despite strict limitations on compute performance and memory. Our approach distills the knowledge of the CLIP model into a lightweight student model, combining EfficientNet-B3 with multi-layer perceptron (MLP) projection heads to preserve cross-modal alignment while significantly reducing computational requirements. We demonstrate that our distilled model achieves a balance between efficiency and performance, making it ideal for deployment in real-world scenarios. Experimental results show that Clip4Retrofit can perform real-time image labeling and object identification on edge devices with limited resources, offering a practical solution for applications such as autonomous driving and retrofitting existing systems. This work bridges the gap between state-of-the-art vision-language models and their deployment in resource-constrained environments, paving the way for broader adoption of foundation models in edge computing.

Problem

Research questions and friction points this paper is trying to address.

Reducing CLIP model complexity for edge devices

Enabling real-time image labeling on resource-limited hardware

Distilling CLIP into lightweight EfficientNet-B3 with MLP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-architecture CLIP distillation for edge devices

Lightweight EfficientNet-B3 with MLP projection heads

Real-time image labeling on resource-constrained hardware

🔎 Similar Papers

No similar papers found.

Authors to Follow