Color in Visual-Language Models: CLIP Deficiencies

📅 2024-10-28
🏛️ Color and Imaging Conference
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper reveals a systematic bias in CLIP’s color representation: severe under-assignment of achromatic labels (white/gray/black) and overreliance on textual cues at the expense of chromatic visual information. To diagnose this issue rigorously, the authors construct a controllable synthetic dataset, design a contrastive Stroop-effect experiment, and integrate neuron-level activation analysis with cross-modal representation visualization. Their analysis identifies deeply layered, text-selective neurons while revealing a striking scarcity of multimodal neurons that genuinely encode color semantics. This work constitutes the first dual-perspective investigation—spanning cognitive mechanisms and neural representations—that explicates CLIP’s color-understanding deficits. It provides an interpretable diagnostic framework and actionable pathways for enhancing color perception in vision-language models. (149 words)

Technology Category

Application Category

📝 Abstract
This work explores how color is encoded in CLIP (Contrastive Language-Image Pre-training) which is currently the most influential VML (Visual Language model) in Artificial Intelligence. After performing different experiments on synthetic datasets created for this task, we conclude that CLIP is able to attribute correct color labels to colored visual stimulus, but, we come across two main deficiencies: (a) a clear bias on achromatic stimuli that are poorly related to the color concept, thus white, gray and black are rarely assigned as color labels; and (b) the tendency to prioritize text over other visual information. Here we prove it is highly significant in color labelling through an exhaustive Stroop-effect test. With the aim to find the causes of these color deficiencies, we analyse the internal representation at the neuron level. We conclude that CLIP presents an important amount of neurons selective to text, specially in deepest layers of the network, and a smaller amount of multi-modal color neurons which could be the key of understanding the concept of color properly. Our investigation underscores the necessity of refining color representation mechanisms in neural networks to foster a more comprehensive comprehension of colors as humans understand them, thereby advancing the efficacy and versatility of multimodal models like CLIP in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

CLIP color encoding deficiencies
Bias in achromatic stimuli
Neuron-level text prioritization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes CLIP's color encoding deficiencies
Identifies text-over-visual bias
Proposes refining color representation mechanisms
Guillem Arias
Guillem Arias
Universitat Autonoma de Barcelona
Artificial IntelligenceMachine Learning
R
Ramón Baldrich
Computer Vision Center / Universitat Autònoma de Barcelona
M
M. Vanrell
Computer Vision Center / Universitat Autònoma de Barcelona