Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLMs

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Vision Transformers (ViTs) and large language models (LLMs) suffer from modality heterogeneity, leading to pretraining incompatibility and unstable fine-tuning when integrated. Method: This paper proposes LUViT—a novel framework enabling *co-pretraining* and *bidirectional alignment* of ViTs and LLMs under the masked autoencoding (MAE) objective for the first time. It pretrains a ViT backbone via MAE while jointly optimizing the LLM parameters using low-rank adaptation (LoRA), thereby moving beyond the conventional paradigm of treating LLMs as frozen prompters in vision tasks. Contribution/Results: LUViT achieves significant performance gains across diverse downstream vision tasks—including image classification, object detection, and semantic segmentation—demonstrating that linguistic knowledge can be efficiently and stably infused into visual representation learning. The framework establishes a new paradigm for unified pretraining of multimodal foundation models, bridging architectural and semantic gaps between vision and language modalities.

Technology Category

Application Category

📝 Abstract

The integration of Large Language Model (LLMs) blocks with Vision Transformers (ViTs) holds immense promise for vision-only tasks by leveraging the rich semantic knowledge and reasoning capabilities of LLMs. However, a fundamental challenge lies in the inherent modality mismatch between text-centric pretraining of LLMs and vision-centric training of ViTs. Direct fusion often fails to fully exploit the LLM's potential and suffers from unstable finetuning. As a result, LLM blocks are kept frozen while only the vision components are learned. As a remedy to these challenges, we introduce Language-Unlocked Vision Transformers (LUViT), a novel approach that bridges this modality mismatch through a synergistic pre-training strategy. LUViT co-adapts a ViT backbone and an LLM fusion block by (1) employing Masked Auto-Encoding (MAE) to pre-train the ViT for richer visual representations, and (2) concurrently training Low-Rank Adaptation (LoRA) layers within the LLM block using the MAE objective. This joint optimization guides the ViT to produce LLM-aligned features and the LLM to effectively interpret visual information. We demonstrate through extensive experiments that LUViT significantly improves performance on various downstream vision tasks, showcasing a more effective and efficient pathway to harness LLM knowledge for visual understanding.

Problem

Research questions and friction points this paper is trying to address.

Bridging modality mismatch between LLMs and ViTs

Enabling effective LLM-ViT fusion for vision tasks

Improving visual representation alignment with LLM knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLM blocks with Vision Transformers

Uses Masked Auto-Encoding for ViT pre-training

Employs Low-Rank Adaptation in LLM blocks

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment