CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
CLIP exhibits a “bag-of-words” (BoW) deficiency in compositional concept understanding: during cross-modal alignment, it fails to reliably bind attributes to their corresponding objects. Crucially, this limitation does not stem from insufficient binding information in unimodal representations, but rather from the cosine-similarity-based cross-modal alignment mechanism itself. To address this, we propose LABCLIP, which introduces a lightweight linear projection layer after the text encoder to recalibrate the text embedding space and explicitly enhance attribute–object binding. This is the first work to identify the root cause of CLIP’s BoW behavior as intrinsic to the cross-modal alignment mechanism—not the unimodal encoders. Experiments demonstrate that LABCLIP significantly improves attribute-binding accuracy in multi-object scenes on benchmarks including COCO-SCE and Visual7W, while also substantially boosting compositional zero-shot inference performance.

Technology Category

Application Category

📝 Abstract
CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding.
Problem

Research questions and friction points this paper is trying to address.

CLIP struggles with attribute-object binding
Issue lies in cross-modal alignment
Proposed LABCLIP improves compositional understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear transformation for embeddings
Improved cross-modal alignment
Enhanced attribute-object binding
🔎 Similar Papers
No similar papers found.