Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

📅 2024-09-28

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

To address the low development efficiency and high training cost of multimodal models, this paper proposes a lightweight vision-language alignment paradigm that freezes unimodal encoders. Methodologically, it selects frozen pretrained encoders (e.g., DINOv2, All-RoBERTa-Large) based on semantic similarity and employs lightweight MLP projectors for cross-modal mapping; it further constructs a high-quality, concept-rich in-house image-text dataset to bypass end-to-end multimodal training. The core contribution is the first systematic empirical validation that frozen unimodal encoders can be directly aligned, coupled with a semantic-guided encoder selection strategy and low-overhead adaptation protocol. Experiments demonstrate state-of-the-art performance: 76% zero-shot accuracy on ImageNet, and superior results across 12 zero-shot classification and 2 cross-modal retrieval benchmarks. Moreover, the approach reduces data requirements by 20× and computational cost by 65× compared to conventional multimodal training.

Technology Category

Application Category

📝 Abstract

Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications. However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Is there a plausible way to connect unimodal backbones for vision-language tasks? To this end, we propose a novel framework that aligns vision and language using frozen unimodal encoders. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on 12 zero-shot classification datasets and 2 image-text retrieval datasets. Our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76(%) accuracy on ImageNet with a 20-fold reduction in data and 65-fold reduction in compute requirements compared multi-modal alignment where models are trained from scratch. The proposed framework enhances the accessibility of multimodal model development while enabling flexible adaptation across diverse scenarios. Code and curated datasets are available at exttt{github.com/mayug/freeze-align}.

Problem

Research questions and friction points this paper is trying to address.

Aligns vision and language using frozen unimodal encoders

Reduces data and compute needs for multimodal alignment

Enhances accessibility of multimodal model development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns vision and language with frozen encoders

Uses simple MLP projectors for training

Reduces data and compute requirements significantly

🔎 Similar Papers

CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features