🤖 AI Summary
The cross-modal similarity prediction mechanism in dual-encoder models (e.g., CLIP) remains opaque, particularly regarding fine-grained interactions between image regions and text tokens.
Method: We propose the first second-order feature-pair attribution method for dual encoders, grounded in differentiable second-order Taylor expansions to quantify interaction importance between image patches and text spans.
Contribution/Results: Our analysis reveals that similarity predictions rely predominantly on cross-modal feature coupling—not unimodal feature contributions—and exhibit strong class dependence and out-of-distribution sensitivity. By clustering error patterns, we identify three canonical failure modes: insufficient object coverage, anomalous scenes, and contextual confusion—enabling interpretable localization of individual prediction errors. This work establishes a new paradigm for explainability in dual-encoder models and provides a reproducible analytical toolkit for rigorous, fine-grained attribution.
📝 Abstract
Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and predict similarities between them. Despite their success, it is, however, not understood how these models compare their two inputs. Common first-order feature-attribution methods can only provide limited insights into dual-encoders since their predictions depend on feature-interactions rather than on individual features. In this paper, we first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs. Second, we apply our method to CLIP models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes also account for mismatches. This visual-linguistic grounding ability, however, varies heavily between object classes and exhibits pronounced out-of-domain effects. We can identify individual errors as well as systematic failure categories including object coverage, unusual scenes and correlated contexts.