Singular Value Few-shot Adaptation of Vision-Language Models

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing vision-language models (e.g., CLIP) face challenges in few-shot adaptation to fine-grained domains, including heavy reliance on prompt engineering or full-model fine-tuning, and instability or catastrophic forgetting induced by auxiliary modules. To address these issues, we propose CLIP-SVD—a parameter-efficient multimodal adaptation method based on Singular Value Decomposition (SVD). CLIP-SVD is the first to apply SVD directly to CLIP’s weight matrices, optimizing only the singular values (0.04% of total parameters), enabling cross-domain joint optimization without introducing new components and fully preserving pre-trained knowledge. Coupled with natural language analysis, it enhances interpretability. Extensive experiments across 11 natural-image and 10 biomedical datasets demonstrate that CLIP-SVD significantly outperforms state-of-the-art methods in accuracy, generalization, and adaptation efficiency.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present extbf{CLIP-SVD}, a novel extit{multi-modal} and extit{parameter-efficient} adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only extbf{0.04%} of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models to fine-grained domains efficiently

Reducing reliance on prompt engineering and full model fine-tuning

Enhancing adaptation performance while preserving generalization ability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses SVD to modify CLIP's internal parameter space

Fine-tunes only singular values for domain adaptation

Achieves adaptation with 0.04% of total parameters

🔎 Similar Papers

No similar papers found.