Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the weak in-distribution (ID) and out-of-distribution (OOD) generalization of vision-language models like CLIP under few-shot fine-tuning, this paper proposes Kalman Filter-based Bayesian Natural Gradient fine-tuning (KF-BNG). KF-BNG is the first method to integrate Kalman filtering into CLIP fine-tuning, enabling online Bayesian inference to dynamically update parameter posterior distributions. It employs a low-rank approximation of the Fisher information matrix for efficient second-order optimization and inherently yields predictive uncertainty estimates. On multiple image classification benchmarks, KF-BNG matches or exceeds state-of-the-art few-shot methods in ID accuracy while substantially improving OOD robustness—achieving average OOD accuracy gains of 3.2–7.8 percentage points. The approach establishes a new paradigm for few-shot vision-language modeling that jointly delivers computational efficiency, distributional robustness, and interpretability through principled uncertainty quantification.

Technology Category

Application Category

📝 Abstract

Vision-language pre-trained models, such as CLIP, have established new benchmarks in multimodal data mining. In such models, few-shot fine-tuning is a major challenge to achieve optimal performance on both in-distribution (ID) and out-of-distribution (OOD) datasets, especially when labeled data is scarce. Most existing fine-tuning approaches rely on first-order gradient-based optimizers, which typically suffer from slow convergence, sensitivity to step-size hyperparameters, and poor generalization in OOD settings. In contrast, second-order methods utilize local curvature information of the loss landscape to adjust the update step size. This is particularly beneficial for CLIP models, whose non-convex loss functions often contain sharp critical points. In such cases, natural gradient direction can offer more substantial and efficient per-iteration updates when fine-tuning with limited data. Natural Gradient Descent (NGD) is obtained by preconditioning the standard gradient with the inverse Fisher Information Matrix (FIM), which is computationally expensive for large models. To address this, we propose a Bayesian approximation of NGD using a Kalman filter for CLIP models. Our method combines the benefits of second-order optimization with Bayesian inference, which enhances generalization while providing uncertainty quantification. Extensive experiments conducted on diverse image classification datasets demonstrate that our algorithm consistently achieves superior--or comparable--ID performance and improved OOD robustness compared to state-of-the-art baselines. To the best of our knowledge, this work represents the first successful application of Kalman filtering to fine-tuning CLIP-based models, which enables more robust and efficient learning in vision-language tasks.

Problem

Research questions and friction points this paper is trying to address.

Fine-tuning CLIP models with limited labeled data for optimal performance

Addressing slow convergence and poor generalization in first-order optimization methods

Reducing computational complexity of Natural Gradient Descent for large models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian approximation of Natural Gradient Descent

Kalman filter for fine-tuning CLIP models

Combines second-order optimization with Bayesian inference

🔎 Similar Papers

Prompt-aligned Gradient for Prompt Tuning