LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

📅 2024-11-07
🏛️ arXiv.org
📈 Citations: 7
Influential: 2
📄 PDF
🤖 AI Summary
To address CLIP’s limitations in comprehending long, complex image captions and achieving fine-grained cross-modal alignment, this paper proposes a lightweight, LLM-augmented framework. Methodologically, it introduces (1) a novel caption-to-caption contrastive fine-tuning paradigm that circumvents adaptation challenges arising from LLMs’ autoregressive nature; (2) freezing the CLIP visual encoder while enhancing only the text branch via LLM integration and parameter-efficient tuning; and (3) preserving the original visual backbone—eliminating architectural modifications and substantially reducing training overhead. Evaluated on zero-shot cross-modal retrieval, cross-lingual retrieval, and multimodal large language model pretraining, the method consistently outperforms CLIP, EVA02, and SigLIP2. Notably, it achieves nearly 4× faster training than LoRA while maintaining superior accuracy.

Technology Category

Application Category

📝 Abstract
CLIP is a foundational multimodal model that aligns image and text features into a shared representation space via contrastive learning on large-scale image-text pairs. Its effectiveness primarily stems from the use of natural language as rich supervision. Motivated by the remarkable advancements in large language models (LLMs), this work explores how LLMs' superior text understanding and extensive open-world knowledge can enhance CLIP's capability, especially for processing longer and more complex image captions. We propose an efficient post-training strategy that integrates LLMs into pretrained CLIP. To address the challenge posed by the autoregressive nature of LLMs, we introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs. Extensive experiments demonstrate that our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance. Furthermore, we validate substantial improvements over state-of-the-art models such as CLIP, EVA02, and SigLip2 across various zero-shot multimodal retrieval tasks, cross-lingual retrieval tasks, and multimodal language model pretraining.
Problem

Research questions and friction points this paper is trying to address.

Enhance CLIP's capability using LLMs for complex captions
Integrate LLMs into CLIP efficiently via post-training
Improve multimodal retrieval tasks with LLM-enhanced CLIP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs into pretrained CLIP efficiently
Uses caption-to-caption contrastive fine-tuning framework
Outperforms LoRA-based methods with faster training
🔎 Similar Papers
No similar papers found.
W
Weiquan Huang
Tongji University
A
Aoqi Wu
Tongji University
Y
Yifan Yang
Microsoft Corporation
X
Xufang Luo
Microsoft Corporation
Y
Yuqing Yang
Microsoft Corporation
L
Liang Hu
Tongji University
Q
Qi Dai
Microsoft Corporation
Xiyang Dai
Xiyang Dai
Microsoft
Computer VisionDeep Learning
D
Dongdong Chen
Microsoft Corporation
Chong Luo
Chong Luo
Microsoft Research
multimedia communicationscomputer vision
Lili Qiu
Lili Qiu
NAI Fellow, ACM Fellow, IEEE Fellow, Professor, Dept. of Computer Science, The University of Texas
Wireless NetworksWireless SensingMobile ComputingSystems5G