CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

📅 2024-12-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address weak cross-modal alignment in lightweight vision-language models under resource-constrained settings—particularly when trained via single image-text contrastive learning—this paper proposes Proximus Intrinsic Neighbors (PIN), a novel neighbor-intrinsic guided contrastive learning paradigm. PIN requires no additional parameters or large-scale data; instead, it leverages pre-trained unimodal encoders to self-generate intrinsic supervision signals and constructs semantically rich positive/negative pairs via k-NN/XNN retrieval. Jointly optimizing a lightweight ViT-XS backbone with a text encoder, PIN achieves +5.5% zero-shot classification accuracy on ImageNet-1K, and improves Flickr30K image-to-text and text-to-image retrieval Recall@1 by 10.7% and 5.7%, respectively. Downstream linear-probe transfer performance also significantly surpasses baselines. The core contribution is the first parameter-free, self-supervised intrinsic neighbor guidance mechanism—uniquely balancing efficiency, robustness, and generalization.

Technology Category

Application Category

📝 Abstract

Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver suboptimal performance when relying solely on a single image-text contrastive learning objective, spotlighting the need for more effective training mechanisms that guarantee robust cross-modal feature alignment. In this work, we propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance, a novel yet simple and efficient training paradigm designed to boost the performance of lightweight vision-language models with minimal computational overhead and lower data demands. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples, i.e., nearest-neighbor (NN) and cross nearest-neighbor (XNN). We find that extra contrastive supervision from these neighbors substantially boosts cross-modal alignment, enabling lightweight models to learn more generic features with rich semantic diversity. Extensive experiments reveal that CLIP-PING notably surpasses its peers in zero-shot generalization and cross-modal retrieval tasks. Specifically, a 5.5% gain on zero-shot ImageNet1K classification with 10.7% (I2T) and 5.7% (T2I) on Flickr30K retrieval, compared to the original CLIP when using ViT-XS image encoder trained on 3 million (image, text) pairs. Moreover, CLIP-PING showcases a strong transferability under the linear evaluation protocol across several downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhance lightweight vision-language models' performance

Improve cross-modal feature alignment with minimal resources

Boost zero-shot generalization and cross-modal retrieval tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Proximus Intrinsic Neighbors Guidance

Enhances cross-modal feature alignment

Improves zero-shot generalization performance

🔎 Similar Papers

No similar papers found.

Authors to Follow