CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Natural image-text data suffer from loose semantic alignment due to weak supervision, whereas medical data exhibit tight alignment but limited diversity—both impeding the robustness and generalization of CLIP models. To address this, we propose CLIPin, a plug-and-play non-contrastive plugin framework that enhances semantic alignment without modifying the backbone architecture or introducing significant parameter overhead. CLIPin seamlessly integrates contrastive and non-contrastive objectives via a cross-modal shared pre-projector and a unified non-contrastive learning module. It is fully compatible with any CLIP variant and requires no retraining of visual or language encoders. Extensive experiments across diverse downstream tasks—including cross-domain retrieval, zero-shot classification, and medical image-text matching—demonstrate consistent and significant performance gains, validating CLIPin’s generality, effectiveness, and practicality.

Technology Category

Application Category

📝 Abstract

Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.

Problem

Research questions and friction points this paper is trying to address.

Improving multimodal semantic alignment in CLIP

Addressing weak supervision in image-text datasets

Enhancing representation robustness and generalizability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-contrastive plug-in for CLIP alignment

Shared pre-projectors for multimodal integration

Parameter-compromise contrastive-noncontrastive learning fusion

🔎 Similar Papers

No similar papers found.