OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address the challenge of simultaneously achieving semantic alignment and fine-grained visual perception in transductive zero-shot learning, this paper proposes OTFusion—a training-free framework that introduces Optimal Transport (OT) for cross-model feature fusion for the first time. OTFusion enables unsupervised alignment between the output distributions of CLIP (a vision-language foundation model) and DINOv2 (a self-supervised vision foundation model), without fine-tuning or auxiliary annotations. By jointly modeling class-level semantic priors and instance-level visual discriminability, it bridges modality gaps at the distribution level. Evaluated on 11 benchmark datasets, OTFusion achieves an average accuracy gain of nearly 10% over the vanilla CLIP baseline, significantly outperforming existing training-free methods. This work establishes a general, efficient, and distribution-aware fusion paradigm for cross-modal zero-shot learning.

Technology Category

Application Category

📝 Abstract

Transductive zero-shot learning (ZSL) aims to classify unseen categories by leveraging both semantic class descriptions and the distribution of unlabeled test data. While Vision-Language Models (VLMs) such as CLIP excel at aligning visual inputs with textual semantics, they often rely too heavily on class-level priors and fail to capture fine-grained visual cues. In contrast, Vision-only Foundation Models (VFMs) like DINOv2 provide rich perceptual features but lack semantic alignment. To exploit the complementary strengths of these models, we propose OTFusion, a simple yet effective training-free framework that bridges VLMs and VFMs via Optimal Transport. Specifically, OTFusion aims to learn a shared probabilistic representation that aligns visual and semantic information by minimizing the transport cost between their respective distributions. This unified distribution enables coherent class predictions that are both semantically meaningful and visually grounded. Extensive experiments on 11 benchmark datasets demonstrate that OTFusion consistently outperforms the original CLIP model, achieving an average accuracy improvement of nearly $10%$, all without any fine-tuning or additional annotations. The code will be publicly released after the paper is accepted.

Problem

Research questions and friction points this paper is trying to address.

Bridging vision-only and vision-language models for zero-shot learning

Aligning visual and semantic information via optimal transport

Improving classification accuracy without fine-tuning or annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bridges VLMs and VFMs via Optimal Transport

Aligns visual and semantic information distributions

Training-free framework for zero-shot learning

🔎 Similar Papers

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models