Formalizing the Binding Problem

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study addresses the fundamental challenge of correctly binding features to objects in visual models by formalizing feature binding through an information-theoretic framework and proposing a novel probing method to quantify binding information within the internal representations of Vision Transformers (ViTs). By analyzing the representations of [CLS] and spatial tokens on both multi-object synthetic and natural images, the authors reveal that current ViTs commonly suffer from insufficient binding, with their binding capacity highly dependent on architectural design choices. The work underscores the critical role of binding information in visual recognition and reasoning, offering a new perspective for understanding and improving model representations.

📝 Abstract

Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

Problem

Research questions and friction points this paper is trying to address.

binding problem

feature binding

Vision Transformers

visual recognition

object representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

binding problem

Vision Transformers

information-theoretic probing