Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the misalignment between named concepts and actual attended image regions in existing Concept Bottleneck Models (CBMs) for fine-grained recognition, which undermines model interpretability. The authors propose a Part-Factorized CBM that, built upon frozen DINOv2 features, enforces each concept to attend exclusively to its semantically corresponding part through foreground gating, part queries, a fixed concept-to-part mapping, and learnable 2D Gaussian spatial priors. This structural design breaks query permutation symmetry and achieves explicit alignment between concepts and spatial regions without requiring per-image keypoint supervision. On CUB-200-2011, the model attains 88.85% top-1 accuracy—nearly matching fully supervised baselines—and improves pointing accuracy by 16 percentage points to 52.6%. Remarkably, even without per-image supervision, it maintains 88.6% classification accuracy and approximately 70% pointing accuracy.

📝 Abstract

Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

Problem

Research questions and friction points this paper is trying to address.

Concept Bottleneck Models

Fine-grained Recognition

Spatial Grounding

Part-based Attention

Interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept Bottleneck Models

Part-Factorized Attention

Spatial Prior