Scaling Language-Free Visual Representation Learning

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual self-supervised learning (SSL) has long underperformed CLIP on multimodal tasks such as visual question answering (VQA), with the community widely attributing this gap to the necessity of linguistic supervision. Method: We systematically scale pure vision SSL models—up to 7 billion parameters—on the unified MetaCLIP dataset, rigorously controlling data composition and evaluation protocols to isolate scalability and multimodal generalization. Contribution/Results: We demonstrate that SSL performance improves consistently with model size without saturation, matching CLIP’s accuracy on VQA and canonical vision benchmarks. This work provides the first empirical evidence that large-scale vision-only SSL—without any linguistic supervision—can learn highly generalizable, multimodal-ready representations. It fundamentally challenges the prevailing assumption that language supervision is indispensable for multimodal representation learning, establishing the feasibility and superior scaling potential of language-agnostic visual representation learning.

Technology Category

Application Category

📝 Abstract
Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question:"Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?"We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.
Problem

Research questions and friction points this paper is trying to address.

Compares visual SSL and CLIP performance in multimodal settings
Investigates impact of language supervision versus training data differences
Demonstrates visual SSL can match CLIP with sufficient scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual SSL scales better than CLIP
Same MetaCLIP data for fair comparison
Pure visual SSL matches CLIP performance
🔎 Similar Papers
No similar papers found.