Causal Interpretation of Sparse Autoencoder Features in Vision

๐Ÿ“… 2025-08-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Interpretation of sparse autoencoder (SAE) features in Vision Transformers often erroneously treats highly activated image patches as causal drivers, overlooking global information mixing induced by self-attention. Method: We propose CaFE (Causal Feature Explanation), the first framework that integrates effective receptive field (ERF) estimation with input attribution to precisely localize the image patches causally responsible for SAE feature activation. Contribution/Results: CaFE reveals that SAE features encode semantic structures dependent on co-occurrence of multiple visual partsโ€”correcting misinterpretations arising from activation-based localization alone. Patch insertion and occlusion experiments on CLIP-ViT demonstrate that CaFE significantly outperforms activation-ranking baselines: it achieves more accurate feature recovery and suppression, and yields explanations with higher semantic fidelity and interpretability. This work establishes the first causal explanation paradigm tailored to SAEs in Vision Transformers, advancing the understanding of their internal representations.

Technology Category

Application Category

๐Ÿ“ Abstract
Understanding what sparse auto-encoder (SAE) features in vision transformers truly represent is usually done by inspecting the patches where a feature's activation is highest. However, self-attention mixes information across the entire image, so an activated patch often co-occurs with-but does not cause-the feature's firing. We propose Causal Feature Explanation (CaFE), which leverages Effective Receptive Field (ERF). We consider each activation of an SAE feature to be a target and apply input-attribution methods to identify the image patches that causally drive that activation. Across CLIP-ViT features, ERF maps frequently diverge from naive activation maps, revealing hidden context dependencies (e.g., a "roaring face" feature that requires the co-occurrence of eyes and nose, rather than merely an open mouth). Patch insertion tests confirm that CaFE more effectively recovers or suppresses feature activations than activation-ranked patches. Our results show that CaFE yields more faithful and semantically precise explanations of vision-SAE features, highlighting the risk of misinterpretation when relying solely on activation location.
Problem

Research questions and friction points this paper is trying to address.

Identifies causal drivers of feature activations in vision transformers
Reveals hidden context dependencies beyond activation locations
Addresses misinterpretation risks from naive activation maps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Feature Explanation method
Leverages Effective Receptive Field
Input-attribution identifies causal patches
๐Ÿ”Ž Similar Papers
No similar papers found.