🤖 AI Summary
Vision-language models like CLIP align image and text embeddings but lack semantic comparability and analogical reasoning capabilities in their embedding space, hindering vectorized reasoning about inter-image differences.
Method: We propose a contrastive learning fine-tuning framework that explicitly aligns image embedding differences to LLM-generated textual difference descriptors (e.g., “thinner”, “brighter”), enabling native pairwise image difference reasoning within CLIP’s embedding space. We further introduce a novel “comparative prompting” inference paradigm to restructure the embedding geometry.
Contribution/Results: After fine-tuning on synthetic difference data, our method improves average accuracy by 3.2% across attribute ranking, zero-shot classification, and retrieval tasks. It significantly enhances linear analogy preservation and directional consistency in the embedding space—providing stronger geometric foundations for downstream applications such as text-to-image generation.
📝 Abstract
Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of their purely text-based alternatives. For instance, while text embeddings have been long noted to satisfy emph{analogies} in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that the differences in image embedding space correspond to emph{text descriptions of the image differences}, which we synthetically generate with large language models on image-caption paired datasets. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification. Finally, we illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space, such as in text-to-image generation.