Finetuning CLIP to Reason about Pairwise Differences

📅 2024-09-15
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models like CLIP align image and text embeddings but lack semantic comparability and analogical reasoning capabilities in their embedding space, hindering vectorized reasoning about inter-image differences. Method: We propose a contrastive learning fine-tuning framework that explicitly aligns image embedding differences to LLM-generated textual difference descriptors (e.g., “thinner”, “brighter”), enabling native pairwise image difference reasoning within CLIP’s embedding space. We further introduce a novel “comparative prompting” inference paradigm to restructure the embedding geometry. Contribution/Results: After fine-tuning on synthetic difference data, our method improves average accuracy by 3.2% across attribute ranking, zero-shot classification, and retrieval tasks. It significantly enhances linear analogy preservation and directional consistency in the embedding space—providing stronger geometric foundations for downstream applications such as text-to-image generation.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of their purely text-based alternatives. For instance, while text embeddings have been long noted to satisfy emph{analogies} in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that the differences in image embedding space correspond to emph{text descriptions of the image differences}, which we synthetically generate with large language models on image-caption paired datasets. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification. Finally, we illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space, such as in text-to-image generation.
Problem

Research questions and friction points this paper is trying to address.

Enhance CLIP's ability to reason about image differences
Improve zero-shot classification via comparative prompting
Establish geometric properties in CLIP's embedding space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Finetune CLIP for difference reasoning
Use synthetic data with LLMs
Enable comparative prompting mechanism
🔎 Similar Papers
No similar papers found.
Dylan Sam
Dylan Sam
PhD Student, Carnegie Mellon University
Machine Learning
D
Devin Willmott
Bosch Center for AI
J
João D. Semedo
Bosch Center for AI
J
J. Zico Kolter
Carnegie Mellon University